Exploring the perceived harshness of cello sounds by morphingand synthesis techniques
Jocelyn Roz�e,a) Mitsuko Aramaki, Richard Kronland-Martinet, and Sølvi YstadAix Marseille Univ, CNRS, PRISM (Perception, Representation, Image, Sound, Music), 31 Chemin J. Aiguier,13402 Marseille Cedex 20, France
(Received 7 June 2016; revised 23 February 2017; accepted 26 February 2017; published online 24March 2017)
Cello bowing requires a very fine control of the musicians’ gestures to ensure the quality of the per-
ceived sound. When the interaction between the bow hair and the string is optimal, the sound is per-
ceived as broad and round. On the other hand, when the gestural control becomes more
approximate, the sound quality deteriorates and often becomes harsh, shrill, and quavering. In this
study, such a timbre degradation, often described by French cellists as harshness (d�echarnement), is
investigated from both signal and perceptual perspectives. Harsh sounds were obtained from experi-
enced cellists subjected to a postural constraint. A signal approach based on Gabor masks enabled
us to capture the main dissimilarities between round and harsh sounds. Two complementary methods
perceptually validated these signal features: First, a predictive regression model of the perceived
harshness was built from sound continua obtained by a morphing technique. Next, the signal struc-
tures identified by the model were validated within a perceptual timbre space, obtained by multidi-
mensional scaling analysis on pairs of synthesized stimuli controlled in harshness. The results
revealed that the perceived harshness was due to a combination between a more chaotic harmonic
behavior, a formantic emergence, and a weaker attack slope. VC 2017 Acoustical Society of America.
[http://dx.doi.org/10.1121/1.4978522]
[TRM] Pages: 2121–2136
I. INTRODUCTION
Musical expressivity has been the subject of a large
number of studies for various instruments relating or not the
musicians’ body movements to the produced sounds. In the
case of cello playing, the particular musician–instrument
interaction, in close embrace type, emphasizes the implica-
tion of the corporeal aspects in the expressivity. We investi-
gate this topic in the present study by focusing on a special
acoustic feature of cello sounds, linked to a degradation of
timbre quality, and frequently referred to as harshness
(d�echarnement in French) by cellist teachers. A harsh sound
(son d�echarn�e) is recognized among bowed string players as
a sound in which timbre is impoverished, inducing a sensa-
tion of shrill, whistling, and loss in sound consistency.
Harsh notes are very frequent among young cellist stu-
dents and often arise from an inappropriate bowing gesture
(Guettler, 2002) and/or a bad postural coordination. Indeed,
many studies have demonstrated that, in addition to instru-
mental gestures directly responsible for the sound produc-
tion, musical interpretation is strongly influenced by
accompanying (or ancillary) gestures, i.e., postural move-
ments that are rarely learnt in music schools. Their musical
significance was explored for the clarinet (Wanderley et al.,2005; Desmet et al., 2012), the piano (Thompson and Luck,
2012), the harp (Chadefaux et al., 2013), and the violin (Van
Zijl and Luck, 2013). For example, Wanderley et al. (2005)
demonstrated that clarinetists might transcribe the phrasing
and rhythmical elements of a musical score through arm
flapping or waist and knee bending. In the same line of
thinking, we were interested in exploring embodied music
cognition of cellists, and how perceived musical features
might result from their gestures and postural synergies.
Leman (2008) introduced this notion of corporeal encoding
as an interconnection between two learning mechanisms: a
high-level reenacting mechanism, called action-perception
loop, able to build a complex sequence of gestures (right
bowing-arm positioning, left hand fingerings, etc.) in order
to reach a local musical target, and a low-level mechanism,
called sensorimotor loop, which continuously adjusts the
musician’s motor program (bow force, bow velocity, pos-
tural variables, etc.) to maintain the control of the sound pro-
duction. In Roz�e et al. (2016), we designed an experiment
with these former studies in mind, by assessing the effects of
different kinds of postural constraints on the cellist’s musical
expressivity along musical phrases. To ensure that expres-
sive degradations were a consequence of postural constraints
and not of technical weaknesses, the musicians that took part
in the experiment were all professional or much experi-
mented players. This experiment turned out to be particu-
larly revealing, since in some difficult musical passages, the
postural constraint affected the cellist’s bowing gesture in a
manner which might produce timbre degradations resulting
in sounds perceived as harsh. The aim of the present study is
to investigate this perceptual harshness in terms of signal
properties.
Previous research related to the perception of harsh
acoustic features for bowed-string instruments can be found
in Stepanek and Otcenasek (2005), Stepanek (2006), and
more recently in Fritz et al. (2012). These perceptual studiesa)Electronic mail: [email protected]
J. Acoust. Soc. Am. 141 (3), March 2017 VC 2017 Acoustical Society of America 21210001-4966/2017/141(3)/2121/16/$30.00
aimed at correlating acoustic properties of violin sounds to a
list of verbal attributes frequently used to characterize musi-
cal timbre. Multidimensional scaling (MDS) analysis was
performed on the dissimilarity scores between pairs of stim-
uli in order to construct a perceptual space of verbal attrib-
utes, likely to explain the acoustic correlates along each one
of its dimensions. For example, in the article by Fritz et al.(2012), the main dimension of the space assessing the
“overall sound quality” corresponds to a spectral balance
between adjectives of desirable sounds (warm, rich, mellow)
and undesirable ones (metallic, cold, harsh). To correlate the
amount of harshness perceived along this dimension with
physical properties, the authors created synthetic violin
sounds by modifying the energy levels in five-octave spec-
tral bands, and performed dissimilarity tests on pairs of these
sounds. They demonstrated that the undesirable perceptual
effect of harshness corresponds to localized and excessive
high-frequency energy, combining most of the time with
poor low-frequency content. Thus, the harshness phenome-
non primarily seems to be linked to a spectral energy transfer
toward upper partials and indirectly to the notion of sound
brightness (brillance). Roz�e et al. (2016) characterized this
spectral transfer more finely by designing a relative bright-
ness descriptor related to the three frequency bands of the tri-
stimulus criterion (Pollard and Jansson, 1982). Numerous
studies (Grey and Gordon, 1978; McAdams, 1999; Caclin
et al., 2005) highlighted a strong correlation between bright-
ness and spectral centroid, i.e., the center of gravity of the
spectral energy distribution. According to Barthet et al.(2008) and Barthet et al. (2010), the brightness might also
play an important role in the tension perceived within clari-
net interpretation between a muffled tone judged as soft and
a bright, more aggressive one. Analyses of cello sounds by
Chudy et al. (2013) confirmed that boosted higher spectral
harmonics would cause a tone to be more “tensed.” As a
consequence, sounds produced in our expressive experimen-
tal context and perceived as harsh might also comply with an
increasingly aggressive and tense musical expression.
The aim of the present study is to better explain the
harshness phenomenon from an acoustical point of view,
based on certain assumptions linked to the way a postural
constraint alters the bowing gesture, and hereby induces this
sound degradation. The works of Demoucron (2008) or
Schoonderwaldt (2009) provide some cues from investiga-
tions on the relationship between the brightness and the
physical parameters of the bow–string interaction in the
case of the violin. They computed the brightness determined
by the spectral centroid values for different combinations
of physical control parameters (bow force, bow velocity,
bow-bridge distance), and projected them into Schelleng
diagrams (Schelleng, 1973). Such diagrams constitute a bi-
dimensional space defined by the bow force and the bow-
bridge distance, and display the variations of acoustical
parameters for a given bow velocity. Within Schelleng dia-
grams, spectral centroids of violin sounds increase with the
bow force and decrease with the bow velocity or the bow-
bridge distance. Guettler et al. (2003), Guettler (2003), and
Schoonderwaldt et al. (2003) also showed that lowering the
bow velocity or increasing the bow force resulted in higher
frequency partials. Hereby, we may assume that harsh
sounds, which are brighter than normal ones, result from
potentially stronger or slower bowing gestures, or smaller
bow-bridge distances (“sul ponticello”). Such sounds, there-
fore, might locate around the upper frontier of the Schelleng
diagram within the “Raucous” area, above which the normal
Helmholtz motion no longer exists.
In this paper, we are going to explore in depth the signal
characteristics composing the notion of harshness for cello
sounds, and validate it from a perceptual point of view by
adapted listening tests. We expect that this phenomenon
matches a complex multidimensional alteration affecting
spectral, but maybe also spectro-temporal dimensions. Thus,
we will first observe this sound quality degradation from a
macroscopical point of view before inferring on the spectral
and temporal descriptors of timbre that best characterize the
harshness. Then, we will use two complementary methods to
perceptually validate these descriptors. In the first method,
the aim is to build a perceptual model of harshness com-
posed of the most relevant signal descriptors. A morphing
technique based on Gabor tools is presented to build con-
tinua between round and harsh sounds. Then the amount of
perceived harshness is predicted by multilinear regression on
all the descriptors to identify the most relevant descriptors in
theory. In the second method, we assess the perceived sensa-
tion of harshness by controlling the envelopes of two rele-
vant descriptors of the predictive model. The process uses
cross-synthesis techniques and is perceptually validated by
the construction of a perceptual timbre space issued from
MDS analysis.
II. QUANTIFYING THE SOUND QUALITYDEGRADATION
In this part, we attempt to quantify the acoustic degrada-
tion of cello sounds, perceived as a harshness phenomenon,
through a macroscopical comparative analysis and finer
comparisons of suitably chosen acoustic descriptors. The
presented sound corpus was used in Roz�e et al. (2016) for
a preliminary and quantitative exploration of the harsh
degradation, based on a spectro-temporal duality between
two descriptors. In the present paper, our aim is to refine this
preliminary approach by proposing a full panel of signal
descriptors that might account for the harshness phenome-
non. An original approach based on Gabor masks is pro-
posed in order to assess the descriptors that best characterize
the timbre transformation between normal and harsh sounds.
A. Sound corpus
The sound corpus was created from data collected in the
experiment conducted by Roz�e et al. (2016). In the protocol
of this experiment, we asked seven cellists to play a score in
the most expressive way, while being subjected to four kinds
of postural conditions. For the scope of this paper, we only
deal with data collected in two opposite postural situations: a
normal one, where the cellists could play naturally as in a
performance context, and a physically fully constrained one,
in which they had to play with the torso attached to the back
of a chair by a safety race harness, and a neck collar limiting
2122 J. Acoust. Soc. Am. 141 (3), March 2017 Roz�e et al.
their head movements. All the participants were professional
or very experienced musicians to ensure that any loss of
expressivity was due to the physical immobilization and no
to potential weaknesses in their playing technique. They
were asked to perform the experimental score with the same
instrument at tempo ¼ 45 bpm, in two articulation variants:
detached (d�etach�e) and legato. By using the same cello, we
ensured that the physical characteristics of the instrument
would not alter subsequent timbre analyses.
For each postural condition, the musicians’ impressions
were collected through a questionnaire assessing their feel-
ings, specific difficulties that they may have encountered, as
well as the perceived influence of the postural condition on
their movements and expressive sound features. Most of them
reported a degradation of their usual sound palette within the
fully constrained postural situation, and especially the impres-
sion of producing “tighter, more tense sounds” (“sons plus�etriqu�es, tendus” in French), “lacking of depth and natural
resonances.” Some of them also evoked more “harsh, shrill,
and whistling notes” (“notes plus d�echarn�ees, criardes etsifflantes” in French), which were particularly common within
a specific musical passage of the score (Fig. 1), containing
more gestural and rhythmic difficulties. A thorough post-
examination of the sound recordings revealed that in spite of
the postural constraint, the cellists managed quite well to pro-
duce a proper Helmholtz motion, except in the specific musi-
cal passage described previously. According to the authors’
subjective judgments, several notes located in the first bar of
the passage were randomly affected by the harsh timbre deg-
radations. The first note of the bar, a dotted sixteenth of pitch
E3 (fundamental frequency 329.63 Hz) was most clearly and
regularly perceived as harsh. As a consequence, we chose this
note to create the sound corpus of the study and to perform
harsh timbre comparisons. It should be mentioned that this
note corresponds to a rhythmic difficulty requiring an excel-
lent coordination between the right and left arms, since it
occurs prior to a pushed bow stroke (pouss�e d’archet) com-
bined with a fast-shift (d�emanch�e rapide) of the left arm. We
may here suppose that the constraint altered this coordination
between the two instrumental gestures, which resulted in a
strong decrease in timbre quality for this note.
The sound corpus was obtained by extracting 20 of these
E3 notes from the recordings of all the cellists, whatever the
articulations d�etach�e or legato. The extraction process for
the E3 notes relied on a pitch-tracking algorithm adapted
from the MIR toolbox (Lartillot and Toiviainen, 2007).
More precisely, these 20 notes were composed of 10 pairs of
E3 notes, respectively, selected within the normal and con-
strained postural contexts for a given cellist and articulation.
A pair of notes associates one of the round/beautiful sounds
produced in the normal situation to one of the harsh/
degraded sounds produced in the constrained situation (for
the same cellist and articulation).
The selection of round/harsh pairs of notes was per-
formed by the authors through an informal listening test,
conducted on the audio recordings obtained after the experi-
ment. The authors used high quality loudspeakers and
headphones to detect significant harshness dissimilarities
between sounds in the normal and the posturally constrained
situation. Their musical expertise and knowledge of cello
practice enabled them to select the sounds that best reflected
the differences between the two postural situations. The rele-
vance of the authors’ judgment was further confirmed by
the results of the first perceptual experiment presented in
this paper [see Fig. 6(a), Sec. III E]. Sometimes, in the con-
strained situation, some cellists played the E3 without being
able to prevent the bow from slipping on the D open string,
which resulted in a sound that contained a second tone in the
onset phase. We systematically removed this artifact as far
as possible by means of a very selective notch filter centered
on the D2 frequency.
The average length of the notes composing this corpus
was 0.31 s, which is shorter than the theoretical duration of
dotted sixteenths (0.5 s at tempo ¼ 45 bpm). The standard
deviation of their durations was 0.06 s. This suggests that the
cellists consistently played the E3 notes more like sixteenths
(theoretical duration 0.33 s at ¼ 45 bpm) than like dotted
sixteenths. Timbre discrimination of such short stimuli might
not be obvious. However, their durations remained higher
than the minimum duration required to recognize the timbre
of a sound (around 100 ms in Schaeffer, 1966). The authors
thus judged the task feasible, especially since the partici-
pants were cellists who are trained to constantly pay atten-
tion to the quality of their sounds.
B. Gabor tools
In the present study, Gabor transforms were used to
reveal differences between pairs of round/harsh sounds com-
posing the corpus. Gabor transforms, or short-time Fourier
transforms (STFTs) with Gaussian analysis windows, are
useful tools to get a macroscopic view of sound signal prop-
erties. They make it possible to represent the timbre of a
sound as an image in the time-frequency space, defined on a
suitably discretized time-frequency lattice. If a,b> 0 are
time and frequency sampling constants, and L denotes the
length of the analyzed signal, the time-frequency lattice of
its Gabor transform is characterized by M, the number of
FIG. 1. Bar of the score selected to investigate the harshness. The analyzed
note (E3) has been circled for each articulation variant: (a) Detached(D�etach�e) playing mode, (b) Legato playing mode.
J. Acoust. Soc. Am. 141 (3), March 2017 Roz�e et al. 2123
frequency channels, and N, the number of time steps, such
that L¼Mb¼Na (LTFAT toolbox of Søndergaard et al.,2012). For a signal x 2 C
L, the transform can be written as
coefficients cm;n 2 CM�N
of the signal expansion in a family
of Gabor atoms gm;n ¼ e2ipmb l�nað Þg l� nað Þ,
cm;n ¼ hx; gm;ni ¼XL�1
l¼1
x lð Þe�2ipmb l�nað Þg l� nað Þ; (1)
with n 2 0;N � 1½ �;m 2 0;M � 1½ �. l is a discrete time vari-
able along the length L of the signal, and a, b> 0 are the
time and frequency sampling constants, respectively.
Differences between Gabor transforms for round and harsh
sounds can further be observed by computing their time-
frequency transfer function. This function is called a Gabor
mask m, and can be evaluated using a minimization process
based on the least-square criterion (Depalle et al., 2006)
W mð Þ ¼ kcm;n x2ð Þ �Mmcm;n x1ð Þk2 þ kkm� 1k2; (2)
where k 2 Rþ is the Lagrange parameter introduced to con-
trol the norm of m.
For the source and target signals, respectively, denoted
x1; x2 2 CL, the mask m is estimated by considering that
there is no difference between the signals when m¼ 1,
m ¼kcm;n x1ð Þkkcm;n x2ð Þk þ k
kcm;n x1ð Þk2 þ k¼ C1C2 þ k
C21 þ k
; (3)
where C1 and C2 are the modulus of the Gabor transforms of
the signals x1 and x2, respectively. As explained by Olivero
et al. (2010) and Sciabica et al. (2012), k plays the role of a
regularization parameter, which enables us to reveal more or
less detailed differences between signals.
C. Highlighting overall features of harshness
In this section we explain how to compute and tune a
Gabor mask in order to capture the salient dissimilarities of tim-
bre between the round and harsh versions of the note E3. One
sound pair was chosen among the ten pairs of the corpus (Roz�e,
2017). First of all, we matched the two signals of the pair in the
time-frequency domain. In the frequency space, this operation
was intrinsically achieved because the pitches to compare were
identical. In the temporal space, however, we needed to syn-
chronize the signal durations with an accurate time-scaling pro-
cess. This implied changing the signal durations without
affecting their pitch-contour and the time-evolution of the for-
mantic structure. The procedure was achieved using AdobeAudition software (San Jos�e, CA) by reducing the duration of
the longest signal so that it fitted with the shortest.
To capture global spectral tendencies within the Gabor
mask, a sliding window size of M¼ 2048 bins was chosen.
This corresponds to a lattice frequency resolution of �20 Hz.
From Eq. (3), we controlled the Gabor mask estimation
with different values of the regularization parameter k to get
the roughest (k� 1) or finest (k! 0) differences between
the signals. The time-frequency representation of the mask is
displayed in Fig. 2 for k¼ 1, and can be interpreted as an
overview of the harshnessing process of a round sound. In
Fig. 2, we can notice that black and white parts of the image
mainly coincide with the frequency bands situated at multi-
ple integers of the fundamental frequency of the note E3,
which is logical since the analyzed sounds are instrumental
and harmonic by nature. More interestingly, we observe as a
whole that the harsh sound (relatively to the round one)
presents strong energy reinforcements (white bands) in the
middle-upper spectrum area, and strong energy decreases
(black bands) in the lower spectrum area. This corresponds
for the harsh sound to a transfer of spectral energy toward
higher frequencies and the emergence of formants.
A detailed observation of the Gabor mask guides us
toward three salient acoustic features that might explain
the harshness phenomenon: an emergence of formantic
areas, an energy loss in the attack part, and fluctuations in
the temporal deployment of harmonic amplitudes. Regarding
the formantic feature, the repartition of black and white
bands clearly indicates a transfer of the spectral energy from
lower to higher partials, with a noticeable reinforcement
of the fifth component, the G]5 (330� 5¼ 1650 Hz). This
may indicate that the harsh E3 note would globally sound
brighter and in sympathetic resonance with its major third.
It is important to note that the emergence of formants might
be related to the physics resulting from an incorrect bowing
gesture (Demoucron, 2008; Guettler, 2003), suggesting a
slower bow speed for harsh sounds. Regarding the attack
feature, we can notice consistent black portions located at
the beginning of each harmonic band. This might correspond
to a global delay in the birth of harmonics composing the
harsh sound, globally implying a less marked attack than in
the round sound. Finally, regarding the harmonic features,
we observe amplitude fluctuations appearing in the temporal
deployment of each partial, which suggests some differences
FIG. 2. Gabor mask computed between the round and harsh signals of a cor-
pus pair for a regularization parameter k¼ 1. Their macroscopic differences
appear as deviations from m¼ 1 (corresponding to no transformation).
White frequency bands correspond to spectral energy reinforcements (mask
deviations greater than one) of the harsh sound compared to the round
sound, whereas black bands correspond to a decrease in spectral energy
(mask deviations lower than one).
2124 J. Acoust. Soc. Am. 141 (3), March 2017 Roz�e et al.
in the relative life cycle of partials between round and harsh
sounds.
D. From Gabor masks to harshness descriptors
The three features revealed by the Gabor mask analysis
suggest that the harshness is multidimensional by nature.
Further investigations of perceived and observed timbre
changes should be carried out by associating these observa-
tions with existing spectral, temporal, and spectro-temporal
timbre descriptors.
1. Spectral descriptors
From a spectral viewpoint, the main phenomenon
revealed by the Gabor mask sheds light on an energy increase
in the high frequencies of the harsh signal, and more precisely
an energy transfer from low to high frequencies with a for-
mantic reinforcement around 1650 Hz. This transfer may cor-
respond to an increase in the barycenter of the spectral energy
distribution, commonly characterized as the spectral centroid
(Peeters, 2004; Schoonderwaldt, 2009). Many studies refer to
this descriptor as perceptually associated with the brightness
(Grey and Gordon, 1978; McAdams, 1999; Caclin et al.,2005; Merer et al., 2007) and it has been validated as a reli-
able indicator of the timbre of bowed-string instruments like
the violin (Stepanek and Otcenasek, 2005; Stepanek, 2006;
Fritz et al., 2012). From our observations of the differences
between round and harsh sound signals, an increase in harsh-
ness might contribute to an increase in brightness.
Considering the harmonic nature of cello sounds, we
decided to compute the harmonic spectral descriptors. For
this purpose, a detection of the harmonic components and
the extraction of their instantaneous amplitudes were per-
formed beforehand. Knowing the fundamental frequency
of the note analyzed (E3), this process could be achieved
through decomposition in subbands centered on the signal
harmonics (Barthet et al., 2007). We could thus compute the
harmonic spectral centroid (HSC) of the sound, which corre-
sponds to the amplitude-weighted mean of its harmonic
peaks in the spectrum
HSC ¼ f0
XK
k¼1
kAk
XK
k¼1
Ak
(4)
where k and Ak are, respectively, the bin index and the ampli-
tude of the kth harmonic peak of the fundamental frequency
f0, and K is the total number of harmonics considered.
To characterize the spectral energy transfer occurring
within the harshness phenomenon, we computed the har-
monic tristimulus (Pollard and Jansson, 1982), which
describes the spectral energy distribution in three frequency
bands as the energy ratio between each band and the total
number of harmonics. The first band contains the fundamen-
tal frequency, the second one the medium partials (two,
three, four), and the last one the higher partials (five and
more). Three coordinates were hereby obtained, correspond-
ing to spectral centroid computations for each band
TR1 ¼A1XK
k¼1
Ak
; TR2 ¼
X4
k¼2
Ak
XK
k¼1
Ak
; TR3 ¼
XK
k¼5
Ak
XK
k¼1
Ak
; (5)
where Ak is the amplitude of the kth harmonic peak and K is
the total number of harmonics considered. From this classi-
cal definition of the tristimulus, we designed a new more
compact ratio focusing on the spectral transfer related to the
harshness phenomenon
TRIratio ¼ TR3
TR1 þ TR2
: (6)
The values of this spectral transfer descriptor are expected to
decrease for a round cello note and increase for a harsh one.
We characterized the emergence of formants revealed
by the Gabor mask using MFCCs (mel-frequency cepstral
coefficients; Davis and Mermelstein, 1980). This descriptor
stands for a simplified view of the spectral envelope with
just a few coefficients, and is therefore well suited to reveal
particular spectral envelope variations such as appearances
or shifts in formantic areas. It also takes into account the fre-
quency selectivity of our auditory system by using frequency
bands on a mel scale that are more selective for lower fre-
quencies than for higher ones (Kim et al., 2006; Bogert
et al., 1963). The MFCCs are linked to the notion of ceps-
trum, which has been intensively used to characterize the
formants of the voice because of its acoustic behavior of
source/filter type. A cello may also be modeled in this way,
if we consider the bow friction on a string as the excitation
source and the instrument-body as the filter. A harsh sound
would result from an incorrect excitation of the string by the
bow, which translates through the cello sound box into the
shift of a formant (an energy transfer) from low- to high-
frequency areas. MFCC coefficients ci are computed through
a DCT (discrete cosine transform) applied to the logarithmic
spectral envelope as follows:
ci ¼XNf
k¼1
log Ekð Þcos i k� 1
2
� �pNf
" #; 1 � i � Nc; (7)
where Ek is the spectral energy measured in the critical band
of the kth mel filter, Nf is the total number of mel filters, and
Nc�Nf is the number of MFCC coefficients to compute.
Standard recommendations were followed for MFCC
calculations (Kim et al., 2006): a mel-filter bank built with a
number of filters Nf¼ 24, and a bandwidth of Bm¼ 2700 mel
(limiting the frequency range to 8000 Hz). With a typical
number of MFCC coefficients Nc¼Nf /2¼ 12, we obtained
the smooth spectral envelopes of a round and a harsh signal,
presented in Fig. 3(a). The formantic structures of these two
envelopes look quite different, but can clearly be explained
by an examination of the first two MFCC coefficients.
Indeed, the MFCC coefficients C1 and C2 [see Fig. 3(b)] are
J. Acoust. Soc. Am. 141 (3), March 2017 Roz�e et al. 2125
the founding parameters that describe the global spectral
shape. C1 induces a low-pass behavior of the spectrum, while
C2 induces a bandpass effect. Note that C0 simply represents
the mean log-energy output of the filterbank, which is not of
interest in the present case when comparing spectral shapes.
The MFCC analysis thus suggests that the shape of the spec-
tral envelope evolves from a low-pass behavior for the round
sound (C1>C2), to a bandpass behavior for the harsh one
(C1<C2). This is coherent with the spectral transfer of har-
monic energies accounted for by the tristimulus criterion
[Eq. (5)].
In the same way as the TRIratio, we designed an MFCC
ratio suitable for describing the harsh formantic behavior
MFCCratio ¼ kC2kkC1k
: (8)
Given the behavior of the MFCC coefficients, the values of
this formantic descriptor should decrease for a round cello
note and increase for a harsh one.
2. Temporal descriptors
From a purely temporal viewpoint, the Gabor masks
shed light on a lack of energy at the beginning of the harsh
signal. Perceptual studies of Grey (1977) and McAdams
et al. (1995) have shown that the rise time of the energy of
the signal during the attack phase plays a prominent role in
instrument classification. From these statements, we deduced
that a suitable descriptor of the attack phase might help to
determine the timbre quality and thus contribute to the per-
ception of the harshness. A common temporal descriptor is
the log attack time (LAT) (Kim et al., 2006):
LAT ¼ log10 Tmax � Tthresholdð Þ; (9)
where Tmax is the time needed for the signal envelope to
reach its maximal value, and Tthreshold is the time needed for
the envelope to exceed 2% of its maximal value.
Furthermore, to propose a descriptor that is less
dependent on the energy differences between signals, we
FIG. 3. (a) Formantic envelopes of the round and harsh signals (the same sound pair as in Fig. 2). The vertical lines correspond to the center frequencies of 24
critical bands in the mel-filter bank, from which spectral envelope energies are computed. (b) The 12 first mean MFCC coefficients associated to the previous
round and harsh formantic envelopes. Absolute values of these coefficients are displayed to highlight the behavior difference between signals.
2126 J. Acoust. Soc. Am. 141 (3), March 2017 Roz�e et al.
computed an attack slope (ATS) or temporal increase
(Peeters, 2004), defined as the temporal average slope of the
energy during the attack phase
ATS ¼ AP
AT; (10)
where AP is the peak value of the signal envelope reached at
time Tmax, and AT¼ Tmax�Tthreshold is the attack time.
According to our observations, the values of this attack
descriptor should increase for a round cello note and
decrease for a harsh one.
3. Spectro-temporal descriptor
The third feature of the Gabor mask suggested the pres-
ence of energy fluctuations or asynchrony in the temporal
deployment of each harmonic amplitude. The instantaneous
amplitudes of harmonic components were thus extracted for
each round-harsh sound pair. In Fig. 4, we present this sub-
band decomposition for the same pair used in Gabor mask
analysis. Interestingly, it can be observed that the nature of
harmonic interactions seems quite different between the two
types of sounds. Indeed, the temporal harmonic structure
deploys itself regularly in the round signal, whereas it is much
more chaotic and disordered in the harsh one, with noticeable
reinforcements in energy of the fifth component around
120 ms and 220 ms after onset. This spectro-temporal phe-
nomenon might be efficiently captured by the HSV (harmonic
spectral variation), a descriptor related to the spectral flux
which focuses on harmonic components, and that has already
been used by Chudy et al. (2013) to assess timbre changes in
cello performances. The HSV is a descriptor related to the
time-varying spectral content, describing the spectral varia-
tion of harmonic amplitudes between adjacent frames. At the
frame level n, the HSV is defined as the complement to one
of the normalized correlation between the amplitudes of har-
monic peaks from two adjacent frames (Kim et al., 2006)
HSV ¼ 1
N
XN
n¼2
1�
XK
k¼1
Ak;n–1Ak;nð ÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXK
k¼1
A2k;n–1
s ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXK
k¼1
A2k;n
s8>>>>><>>>>>:
9>>>>>=>>>>>;; (11)
where Ak,n denotes the amplitude of the kth harmonic peak
in the nth frame. K and N are, respectively, the total number
of harmonics and frames considered. The values of this spec-
tral fluctuation descriptor should decrease for a round cello
note and increase for a harsh one.
E. Validating harshness descriptors
In order to assess the relevance of the six previously pre-
sented descriptors to characterize the harshness at the signal
level, we performed statistical tests on the sound corpus,
composed of round and harsh notes divided in two groups of
ten samples. Mean values and standard deviation were then
computed on the two groups for each descriptor. The results,
presented in Fig. 5 and Table I, indicate that all the descrip-
tors except the LAT are promising candidates for explaining
the harshness. The relevance of each signal descriptor was
evaluated by performing a simple paired two-tailed t-test,
based on the null hypothesis that the means of the two
groups are the same. We can also observe that, except for
the LAT and ATS, the descriptor values are much more
spread for harsh sounds than for round sounds. Regarding
the spectral descriptors (HSC, TRIratio, and MFCCratio),
this phenomenon suggests that harsh sounds would sound
more or less bright, according to spectral energy increases in
the upper or middle parts of the spectrum, respectively. By
contrast, round sounds may be perceived as “deep” and
“mellow,” because their spectral energies are better localized
on low-frequency components. In the same way, the spectro-
temporal descriptor values (HSV) that reveal a stronger
spectral spread within harsh sounds, suggests different levels
of asynchrony in the temporal evolution of transients. By
contrast, round sounds may be characterized overall by a
more systematic synchronicity between the harmonic transi-
ents. In the two next parts, we will try to assess the percep-
tion of these signal tendencies through two complementary
analysis methods.
III. BUILDING A PERCEPTUAL MODELOF HARSHNESS
From the validation of harshness descriptors, we con-
cluded on five acoustic descriptors that potentially could dis-
criminate round and harsh sounds from a signal point of
view. In this part, we propose an experimental protocol that
aims at identifying the most prominent signal descriptors
reflecting the variation of perceived harshness. This is based
on a perceptual evaluation of intermediate sounds between
round and harsh signals of a pair.
FIG. 4. Instantaneous amplitudes of
the first five harmonics for the (a)
round and (b) harsh sounds (sound pair
of Fig. 2).
J. Acoust. Soc. Am. 141 (3), March 2017 Roz�e et al. 2127
A. Participants
Fifteen experienced cellists took part in the experiment.
They were all volunteer teachers or students from musical
schools or the conservatory of Marseille. All of them had
extensive experience in cello playing, from 7 to 40 years of
practice (the youngest had teacher recommendations). None
of them had self-reported hearing problems.
B. Continua of stimuli
An intuitive way to assess an amount of perceived
harshness consists in comparing a continuum of gradually
harsher sounds with a reference sound judged to be of good
quality (round; Aramaki et al., 2009; Aramaki et al., 2011).
The ten sound pairs of the corpus were thus used as a base to
build continua of stimuli for this experiment. For each sound
pair, we carried out a morphing process to create a contin-
uum of four sounds gradually deteriorating into harshness.
This process first required a synchronization of the durations
of the round/harsh signals for each pair. It was achieved with
the Adobe Audition software, by using the same time align-
ment technique as for Gabor mask construction.
The morphing process between pairs of round and harsh
sounds was possible thanks to the properties of the Gabor
multipliers (Olivero et al., 2010). Indeed, from Eq. (2), we
define Mm, the Gabor multiplier of mask m, which morphs
into a target signal x2 by a pointwise multiplication of its
mask with the Gabor transform coefficients of the source
signal x1,
x2 ’Mmx1 ¼Xm;n
mcm;n x1ð Þhm;n; (12)
where cm,n(x1) is the Gabor transform of the source signal x1,
and hm,n the dual synthesis window of gm,n [Eq. (1)].
To create the four sounds composing the continuum of a
given pair, we provided the multiplier equation [Eq. (12)]
with the round sound as the source signal x1, and the harsh
one as the target signal x2. Intermediate sounds of these
two signals were generated by computing several Gabor
masks with gradually decreasing values of the regularization
parameter k [Eq. (3)]. The steps selected for k have been
chosen by the authors through an informal listening test to
produce the most continuous evolution in harshness percep-
tion. For each sound pair, two Gabor masks corresponding to
two gradual intermediary k values were computed, and the
two associated sounds resynthesized with the multiplier for-
mula. The four sounds composing each continuum were
finally equalized in loudness with the Loudness toolbox
(Genesis, 2009). The whole procedure performed on each of
the ten sound pairs resulted in a sound corpus composed of
ten continua of four stimuli (cf. Roz�e, 2017, for examples of
continua).
FIG. 5. (Color online) Comparison of
spectral, temporal, and spectro-temporal
features between the two groups of
round and harsh sounds. The central
marks are the medians, the edges of the
boxes are the 25th and 75th percentiles.
TABLE I. Results of paired t-tests on the six defined acoustical descriptors.
The relevance of discriminating the round and harsh groups of ten sounds
for each descriptor is given by the p-value p: *p< 0.05, **p< 0.01,***p< 0.001.
Descriptors HSC TRIratio MFCCratio LAT ATS HSV
t(9) ¼ 5.69*** 4.1** 4.13** 1.21 �6.17*** 5.16***
2128 J. Acoust. Soc. Am. 141 (3), March 2017 Roz�e et al.
Note that the morphing process that transforms the
round sound into a harsh one is a monotonously decreasing
mathematical transformation, which not necessarily reflects
a linear perception of harshness. For this reason, in some
continua, it was slightly more difficult to ensure the per-
ceived continuity between the second synthetic stimulus and
the harsh target sound x2. The subject’s scores observed in
the following MUSHRA (multi stimulus test with hidden ref-
erence and anchor)-based listening test [Fig. 6(a)] confirmed
this perceptual tendency.
C. MUSHRA-based procedure
The listening test of this experiment was designed with
the MATLAB software, and installed on a laptop with
Sennheiser HD-650 headphones (Wedemark, Germany). We
performed it in the laboratory or in quiet places depending on
the cellists’ availabilities. Its conception was inspired from
the MUSHRA comparison method (ITU-R Recommendation
BS.1534-1, 2003), commonly used for audio codec quality
assessment. The MUSHRA design relies on multiple compari-
sons of items with respect to a given reference. This reference
is hidden among the set of items in order to potentially elimi-
nate subjects who cannot recognize it.
We designed an interface based on these principles
(cf. Roz�e, 2017), but suited to our perceptual comparisons
of harshness. Each subject was randomly presented with
ten successive trials corresponding to the ten continua of
four stimuli to compare. Within each continuum, the presen-
tation order of the four sounds was randomized. The round
sound was chosen as the hidden reference of the MUSHRA
test. Subjects were asked to rate the relative amount of per-
ceived harshness between this reference and each of the four
sounds of the continuum according to a continuous quality
scale (CQS). This CQS consists of identical graphical scales
divided into five equal intervals gradually composed from
the bottom to the top scale of the adjectives “barely more
harsh” (a peine plus in French), “a little more harsh” (un peuplus), “more harsh” (plus), “considerably more harsh” (bienplus), and “much more harsh” (beaucoup plus). Subjects
FIG. 6. Observed vs predicted scores of the least-square multiple linear regression model (a) by continuum (observed with the subjects’ standard deviations in
grey solid line, predicted in black dotted line) and (b) for all the continua (R2¼ 0.724, n¼ 40). (c) Normal probability plot of residuals. (d) Raw dispersion of
residuals.
J. Acoust. Soc. Am. 141 (3), March 2017 Roz�e et al. 2129
could rate 0 for a sound on this scale if they did not perceive
a difference in harshness with respect to the reference, or
even judge it identical. They could listen to the sounds as
many times as desired, but it was not possible for them to
return to previous continua once their choice was validated.
D. Data analysis
To investigate the relationships between the multidi-
mensional set of acoustic features and the perceptual scores
of the subjects, we carried out a multiple linear regression
analysis. Two regression equations containing the most sig-
nificant signal predictors were inferred. We here describe the
different data preparation steps for the regression process.
1. Score checking
Subjects’ ratings were most often zero for the sound cor-
responding to the hidden reference, although a small amount
of harshness was sometimes attributed for this reference. We
checked that the score of the round reference was always the
lowest within each continuum. Consequently, we did not
reject any continuum or subject.
2. Dataset building
For the 40 sounds composing the experimental plane,
we computed the 5 acoustical descriptors from Sec. II E. In
addition, the distribution of the four perceptual scores within
each continuum was centered and normalized (mean 0, stan-
dard deviation 1) to compare the ten continua independently
of their respective references. These perceptual scores were
averaged on all the subjects for each sound.
3. Multiple linear regression
We performed a multiple linear regression analysis
on the dataset in order to find a relevant model of signal
descriptors which might explain the perceptual harshness
scores (40 observations). This regression relies on typical
least squares fitting, a process consisting in searching the fit-
ting coefficients bp, which minimize the mean square differ-
ence �2 between the model bY (or prediction vector Xpbp) and
the response vector Y,
Y ¼ bY þ � ¼ Xpbp þ �; (13)
where p¼ 5 independent variables Xp of signal descriptors
computed on the 40 sounds. To certify the robustness of
this basic model, we checked the distribution of its residual
vector �, and validated the assumptions of linearity and
homoscedasticity.
As a complementary approach, we also performed a
coarser, stepwise regression to identify the most prominent
descriptors of the basic predictive model. This stepwise
regression has the advantage of automatically detecting mul-
ticollinearities among the predictors. It is here used as an
additional approach to keep the less redundant descriptors
that are likely to explain the perceived harshness scores.
E. Results
The results of the least-square multiple linear regression
method are presented in Table II. They suggest that three out
of the five regression coefficients are statistically different
from zero (significance level 5%): The spectro-temporal
descriptor HSV is predominant in the model (p< 0.001),
the temporal descriptor ATS also seems to be relevant
(p< 0.01), and the MFCCratio appears as the most relevant
among the purely spectral descriptors (p< 0.05).
Figure 6 describes the fitting between the predictive
model and the observed scores. In Fig. 6(a), we notice that
the subjects actually perceived a global increase in amount
of harshness within most sound continua. This perception
was not fully linear, since they often rated the second syn-
thetic stimuli (s3) as harsher than the real harsh sound (s4).
This can be explained as a consequence of the morphing pro-
cess that does not directly reflect human perception. In the
same way, most of the predictions within each continuum
follow an increasing tendency, which reveals that it was
possible to find a linear combination of acoustic descriptors
that fitted quite well the observed scores of harshness.
This tendency is confirmed in Fig. 6(b) with a rather signifi-
cant determination coefficient (R2¼ 0.724 and adjusted R2
¼ 0.683). Similarly, the statistical distribution of the resid-
uals [Fig. 6(c)] also tends to confirm that linearity and homo-
scedasticity assumptions are verified. Indeed, the linearity
hypothesis appears in the normal probability plot of resid-
uals, and homoscedasticity in the regular scattering of resid-
uals on both sides of the axis y¼ 0. A Lilliefors test
confirmed that the residual vector � exhibits a normal distri-
bution (p> 0.5).
From the process of least-square regression, we thus
obtain a basic predictive model of perceived harshness. This
can be written as an equation predicting the cellists’ scores
from the three best fitting descriptors
Ybasic ¼ 0:29MFCCratio� 0:36ATSþ 0:49HSV: (14)
The stepwise regression approach provides a comple-
mentary model of lesser significance (R2¼ 0.684 and
adjusted R2¼ 0.667), retaining only two prominent descrip-
tors: the HSV (p< 0.001) and the ATS (p< 0.001),
Ystepwise ¼ 0:59HSV� 0:38ATS (15)
F. Discussion
The global equation [Eq. (14)] suggests that the sensation
of perceived harshness is multidimensional and characterized
by a complex transformation of timbre involving spectral,
temporal, and spectro-temporal aspects. Interestingly, we can
TABLE II. Results of the least-square multiple linear regression. The p-
value p of each descriptor gives the relevance to add its coefficient bp to the
model: *p< 0.05, **p< 0.01, ***p< 0.001.
Descriptors HSC TRIratio MFCC ratio ATS HSV
Coefficients bp 0.45 �0.58 0.29* �0.36** 0.49***
2130 J. Acoust. Soc. Am. 141 (3), March 2017 Roz�e et al.
notice the negative coefficient of ATS, while positive coeffi-
cients are obtained for HSV and MFCCratio. This observation
reveals an anti-correlation between temporal and spectral fea-
tures, already observed in Sec. II D: An increase in perceived
harshness induces a formantic emergence and more harmonic
asynchrony, associated to a decrease in the temporal ATS.
This is consistent with the Helmholtz motion, which builds
more slowly or is more unstable. Furthermore, the stepwise
regression eliminates the MFCCratio as a harshness predictor,
which suggests a certain collinearity between the MFCCratio
and the HSV descriptors. This dependence is confirmed by
the correlation coefficient between the two descriptors (r2
¼ 0.65***), which can be explained from signal consider-
ations, since the emergence of a formant corresponds to a sud-
den growth of upper partials within the harmonic’s life cycle
[see the evolution of the fifth component in Fig. 4(b) for an
example]. This implies more crossings between the temporal
evolutions of harmonic amplitudes, which should result in a
stronger harmonic disorder as an indirect consequence.
Gabor multipliers turned out to be a quite efficient tool
to create continua with increasingly harsher sounds. The
degradations of timbre produced by this technique led to a
model of perceived harshness. However, this approach
presents some limitations since it proposes a statistical con-
struction relying on the variations of a single feature (the
regularization parameter k) which has no significance in
terms of acoustic descriptors. This is not sufficient to evalu-
ate the individual perceptual relevance of each descriptor.
Section IV aims at digging deeper into these aspects.
IV. TOWARD A CONTROL OF PERCEIVEDHARSHNESS
In this section, we are going to assess to which extent
each descriptor brought to light by the predictive model con-
tribute to the perceived harshness. This procedure is essential
as a perceptual validation of each descriptor. For this pur-
pose, a synthesis process is used to manipulate/control the
temporal and spectral signal envelopes. This control will fur-
ther be perceptually validated by a paired comparison test,
resulting in the elaboration of a perceptual harshness space.
A. Choice of control descriptors
The basic predictive model [Eq. (14)] of perceived
harshness suggests that both spectral and temporal features
are involved in this phenomenon. Among the three main
descriptors (HSV, ATS, and MFCCratio) characterizing the
harshness, the HSV was found to be predominant. However,
this descriptor is not suitable for resynthesizing and control-
ling the harshness, since it encompasses too many details in
the signal. Instead, we might take advantage of the spectro-
temporal duality of the other descriptors unveiled in the pre-
vious experiment to attempt to control the harshness through
only a few parameters: the coefficients composing the
MFCCratio, which can be considered as a shaping parameter
of the spectral sound envelope (and which is correlated with
the HSV), and the ATS, defining the attack slope, likely to
be considered as a shaping parameter of the temporal attack
envelope.
The validity of this assumption can be tested by carrying
out a partial bivariate least-square linear regression on the
purely spectral MFCCratio and purely temporal ATS predic-
tor. This leads to a less relevant predictive model than Eq.
(14) (R2¼ 0.595 and adjusted R2¼ 0.573), but with satisfy-
ing p-value predictors: MFCCratio (p< 0.001) and ATS
(p< 0.001). We therefore decided to use these descriptors to
control the perceived harshness.
B. Participants
The harshness, produced by this synthesis method, was
assessed through a listening test based on dissimilarity rat-
ings of sound pairs. The participants who took part in this
listening test were the same 15 experienced cellists as in
Sec. III A. Each of the two tests lasted for around 15–20 min
for each subject. The tests were separated by a short pause.
The total duration of these two listening tests was thus
around 45 min per subject.
C. Stimuli
1. Cross-synthesis
A round/harsh representative sound pair was chosen by
informal listening tests from the ten paired samples compos-
ing the sound corpus. For each signal of this pair, we used
the technique of subband decomposition of the harmonics to
design a synthetic calibrated sound composed of 25 harmon-
ics. This number of harmonics was necessary to obtain a per-
ceptually convincing cello sound.
From this root synthetic pair (a round and a harsh sound)
called S1/S8, we created six intermediary synthetic stimuli
using a cross-synthesis technique performed in the spectral
and temporal domains. This technique requires the parame-
terization of spectral and temporal envelopes by means of a
basis function expansion: env ¼PK
k¼1 Ck/k, where Ck are
the coefficients of the expansion, and /k the set of basis
functions to be combined linearly. This parameterization
turns out to be a very flexible way to control and adjust the
shape of the envelopes since it enables us to design enve-
lopes, presenting intermediate spectral or temporal features
of round and harsh sounds.
To apply intermediate envelopes to the source sound, a
whitening process consisting in dividing the source signal
by its own envelope (env1) was first effectuated. The result-
ing (whitened) signal was then multiplied by the intermedi-
ate envelope (env2). This cross-synthesis technique thus
consisted in filtering the source sound by the transfer func-
tion env2/env1. To avoid divergence effects due to potential
divisions by zero of the source envelope env1, we turned
the division of the transfer function into a difference of
logarithmic envelopes (Z€olzer, 2008): log jenv2j=jenv1jð Þ¼ log jenv2j � log jenv1j. By applying the exponential oper-
ator to this logarithmic difference, the cross-synthesis pro-
cess of envelopes could be achieved accordingly.
a. Spectral cross-synthesis. In the spectral domain, we
parameterized a formantic envelope as an expansion of the
mel-cepstrum coefficients Ck, with the discrete cosines
J. Acoust. Soc. Am. 141 (3), March 2017 Roz�e et al. 2131
functions /k as a basis. Indeed, according to Eq. (7), we can
rebuild a logarithmic envelope Espec of the spectrum signal
magnitude jX fð Þj from a vector of MFCC coefficients by
means of an inverse DCT (Poli and Prandoni, 1997),
Espec mel fð Þð Þ ¼XNc
k¼1
Ck cos 2pkmel
Bm
� �; 0 � mel � Bm
’ log jX fð Þj; (16)
where Nc is the subset of coefficients chosen for resynthesiz-
ing the spectrum envelope. The bandwidth of the Nf¼ 24
mel-filter bank already used for computations of MFCCs is
Bm¼ 2700 mel (Sec. II D). Espec represents a smoothed ver-
sion of the real spectral envelope, whose precision depends
on the number of coefficients used for the reconstruction (we
chose Nc¼Nf /2¼ 12). From the MFCC vector modeling the
formantic envelope, we then obtained intermediate spectral
envelopes between sounds S1 and S8 by tuning the coeffi-
cients C1 and C2 composing the MFCCratio.
b. Temporal cross-synthesis. In the time domain, we
parameterized a temporal envelope as an expansion of
B-spline basis functions /k weighted by coefficients C0k. It
approximated the temporal signal e(t) by an envelope Etemp
composed of the sum of a small set of scaled splines
(Ramsay, 2006)
Etemp tð Þ ¼XmþL�1
k¼1
C0kBk t; sð Þ ’ e tð Þ; (17)
where Bk(t,s) is the value at t of the B-spline basis function
defined by an order m and a knot sequence s of length L – 1.
The number of splines was defined by the order plus the
number of interior knots mþL – 1. A basis of five piecewise
cubic splines (order m¼ 3) was chosen to parameterize the
temporal envelopes with quite a good fit. From there, we
could design some intermediate temporal envelopes between
sounds S1 and S8, by specifically tuning the coefficients C03and C04 of their five-spline vectors. Indeed, this manipulation
made it possible to move the peak attack amplitude forward
or backward, which intrinsically corresponded to controlling
the ATS.
2. Eight synthetic sounds
Figure 7 gives a synopsis of this stimuli design struc-
tured into two clusters, a round one and a harsh one. Each
cluster is composed of the initial root sound (S1 or S8,
respectively), and three synthetic sounds designed by cross-
synthesis from the initial stimuli (cf. Roz�e, 2017).
In the spectral domain, we created two sounds in each
cluster. For the round cluster created from the round sound
S1, the formantic envelope of the sound S3 was designed by
decreasing C1 and increasing C2. Symmetrically, for the
harsh cluster created from the harsh sound S8, the formantic
envelope of the sound S6 was designed by increasing C1 and
decreasing C2. The final values of the MFCC coefficients C1
and C2 hereby defined the intermediate spectral envelopes of
S1 and S8. The upper part of Fig. 7 presents the shapes of
the formantic envelopes for these four sounds:
cS3 ¼ cS1 �ERH
spec
ERspec
; cS6 ¼ cS8 �EHR
spec
EHspec
;
where bSi represents the spectrum of the sound Si.
In the temporal domain, we also created two sounds in
each cluster. For the cluster round created from the round
sound S1, the sound S2 with a lower ATS was designed
by decreasing C03 and increasing C04. Symmetrically, for the
cluster harsh created from the harsh sound S8, the sound S7,
with a higher ATS, was designed by increasing C03 and
FIG. 7. Design by cross-synthesis of the eight synthetic stimuli. (Top) The
two original spectral envelopes of S1 (ERspec) and S8 (EH
spec) with their respec-
tive intermediaries (ERHspec and EHR
spec). (Middle) Cluster of the four synthetic
sounds built from the round sound S1 (mid-left) and from the harsh sound
S8 (mid-right). (Bottom) The two original temporal envelopes of S1 (ERtemp)
and S8 (EHtemp) with their respective intermediaries (ERH
temp and EHRtemp).
2132 J. Acoust. Soc. Am. 141 (3), March 2017 Roz�e et al.
decreasing C04. The final values of the spline coefficients C03and C04 hereby defined the intermediate temporal envelopes
of S1 and S8. The bottom part of Fig. 7 presents the shapes
of the temporal envelopes for these four sounds:
S2 ¼ S1�ERH
temp
ERtemp
; S7 ¼ S8�EHR
temp
EHtemp
:
The two last synthetic stimuli were created by hybrid
crossings of the spectral and temporal envelopes in each
cluster. For the cluster round, the sound S4 was designed by
applying the temporal envelope of S2 to the sound S3.
Symmetrically, for the cluster harsh, the sound S5 was
designed by applying the temporal envelope of S7 to the
sound S6:
S4 ¼ S3�ERH
temp
Etemp S3ð Þ ; S5 ¼ S6�EHR
temp
Etemp S6ð Þ;
where Etemp(Si) represents the temporal envelope of the
sound Si.
Finally, we equalized the loudness of the eight synthetic
stimuli by the means of the Loudness toolbox (Genesis,
2009) to ensure timbre comparisons with the same sound
level.
D. Paired-comparison procedure
The listening test of this experiment was performed sub-
sequent to the previous one (Sec. III C) with the same equip-
ment and in the same conditions. We designed an interface
(cf. Roz�e, 2017) to evaluate the amount of perceived harsh-
ness between pairs of the eight synthetic stimuli. The combi-
nations of pairs were presented in both ways (direct AB and
reverse BA), and in a global random way for each subject to
avoid any effect of presentation order. Since identical pairs
were not presented, a total of 56 pairs (8� 7) was evaluated.
For each sound pair presented, the subject was asked to
process two tasks: first, choose the sound of the pair A/B per-
ceived as the most harsh. It was possible to check a third
item entitled “none” if the two sounds were considered equal
in terms of harshness. Then, assess the difference of per-
ceived harshness of the checked sound relatively to the other
one, according to a CQS, corresponding to a graphic scale
divided into three equal intervals, i.e., “a little bit more
harsh,” “more harsh,” and “much more harsh.” The subjects
were informed that they could rate zero on this scale, if they
judged the two sounds of a pair identical in terms of harsh-
ness, even if they were structurally different. In this case,
they had to check the “none” item. Conversely, the harshness
rating was only available when a non-zero rating was chosen
on the scale. They could listen to the sounds of each pair as
many times as they desired, but it was not possible for them
to return to previous pairs once their choice was validated.
E. Data analysis
To explore the repartition of the subjects’ ratings, we
performed a metric MDS analysis. We provided the MDS
with a dissimilarity matrix composed of the harshness pro-
portion rated by the subjects between pairs of synthetic stim-
uli. This dissimilarity matrix was designed as a full non-
symmetric 8� 8 table of 56 harshness ratings, including the
2 ways of presenting the stimuli (direct AB in the upper-
triangular and reverse BA in the lower-triangular). Since
only symmetrical matrices can be used in MDS analyses, we
averaged the dissimilarity ratings of each symmetric element
and stored this average in an upper-triangular matrix of 28
(56/2) distances. This process was repeated for each partici-
pant, and finally we obtained a global dissimilarity matrix by
averaging the individual matrices of all subjects. The quality
of the obtained MDS configuration was evaluated by taking
into account two criteria: a scree-elbow criterion of the
Kruskal stress evolution (for the choice of the right number
of perceptual dimensions in the new space), and an analysis
of the Shepard diagram (for the preservation of the distances
in the new space relative to initial dissimilarities). Finally,
each dimension of the MDS space was correlated with the
salient signal descriptors of harshness previously obtained.
F. Results and discussion
The goodness-of-fit criterion, corresponding to a mini-
mization of the Kruskal stress, indicated that three dimen-
sions were sufficient to get an optimal MDS configuration
(stress � 0.093). Figure 8 presents the tridimensional space
of the eight synthetic stimuli obtained by the MDS, with its
two main bi-dimensional projections. This system depicts a
perceptual harshness space. Table III reports the correlations
between each MDS dimension and the five acoustic descrip-
tors highlighted in Sec. II E.
On the first spatial dimension, a very strong correlation
can be observed with the HSVs descriptor, and to a lesser
extent with the temporal ATS. The high HSV score suggests
that the perceived harshness essentially increases with a loss
of synchronism between the harmonics of the signal. The
ATS score, weaker and negatively correlated to the first
dimension, reveals a concurrent loss of ATS in the temporal
signal envelope. The predominance of HSV in these results
is a bit surprising, given that it was not directly used to cre-
ate the stimuli. However, these variations in harmonic cha-
otic behavior actually emerged as an indirect consequence of
synthesis transformations on the formantic envelopes.
Interestingly, the first dimension of Fig. 8(a) splits the
three-dimensional (3D) space into two groups matching the
clustering design of the synthetic stimuli: At the left, we
obtain the round cluster (S1, S2, S3, S4) and at the right the
harsh one (S8, S7, S6, S5). Besides, the pairs of stimuli that
only differ by their ATS (S1/S2, S3/S4, S5/S6, and S7/S8)
rate quite similarly on this axis. Therefore, in the round
cluster, the intermediary pair of stimuli S3/S4 was perceived
as harsher than the S1/S2 pair because of a stronger har-
monic asynchrony coupled with a smallest temporal ATS.
Symmetrically, in the harsh cluster, the intermediary pair of
stimuli S5/S6 was perceived as rounder than the S7/S8 pair
because of less harmonic asynchrony coupled with a greatest
temporal ATS. These results validate the physical relevance
of the HSV and ATS descriptors in the predictive harshness
J. Acoust. Soc. Am. 141 (3), March 2017 Roz�e et al. 2133
model [Eq. (14)]. It is also consistent with the stepwise com-
plementary approach of the previous experiment [Eq. (15)],
and confirms the central role of the simultaneous opposite
variations of HSV/ATS in harshness perception.
Referring back to Table III, the second spatial dimen-
sion correlates quite well with all the purely spectral descrip-
tors HSC, TRIratio, and MFCCratio. The strong correlation
with the MFCCratio validates the physical coherence of this
descriptor in the predictive model of harshness [Eq. (14)].
Besides, the HSC and TRIratio descriptors arise as corollar-
ies from modifications of the spectral shape, since the emer-
gence of a formant necessarily implies a transfer of spectral
energy. In Fig. 8(b), the projection of the MDS configuration
on the first two dimensions shows that the stimuli do not fol-
low the cluster design any longer. Certain sounds synthe-
sized from intermediate formantic envelopes were actually
rated outside the bounds of the two root sounds (S1 and S8).
On the second dimension, both stimuli S3/S4, formantically
harsher than S1, seem hereafter to be perceived with more
degraded quality than the root harsh sound S8. And symmet-
rically, both stimuli S5/S6, formantically less harsh than S8,
now seem to be perceived with better timbral quality than
the round root sound S1. This result turns out to comply with
the evolution of the MFCCratio computed from intermediate
values of the first two Mel cepstrum coefficients (C1 and
C2). Therefore, the second dimension should be interpreted
with respect to the spectral energy distribution, and tends to
characterize the harshness phenomenon in terms of per-
ceived brightness.
Finally, Table III does not reveal at first sight any acous-
tic correlates of the third spatial dimension. However, the
projection of the MDS configuration on the dimensions one
and three of the perceptual space [Fig. 8(c)] reveals a sur-
prising organization of the stimuli. Indeed, they are distrib-
uted on either side of the third axis, according to pairs only
differing by the ATS of their temporal envelope. With the
exception of the pair S5/S6 perceived a little bit confusedly,
the sounds composed of a harsh temporal envelope are nega-
tively rated (S2, S4, S8), whereas those composed of a round
temporal envelope are positively rated (S1, S3, S7). This
third dimension thus reflects that subjects only detected
changes in temporal ATS, without significant consequences
on their perception of harshness.
V. CONCLUSION
In this paper, we explored a deterioration of the timbre
quality for cello sounds, often characterized as sound
FIG. 8. (Color online) (a) Perceptual space resulting from the metric MDS achieved in three dimensions (Kruskal stress¼ 0.093). The spatial configuration is
projected on the two main pairs of dimensions: (b) Dim1 vs Dim2, (c) Dim1 vs Dim3.
TABLE III. Correlations between the salient signal descriptors of harshness
and the coordinates of the eight stimuli along the three dimensions of the
MDS space, p-values: *p< 0.05, **p< 0.01, ***p< 0.001. Significant values
appear in boldface.
Acoustic correlates Dim 1 Dim 2 Dim 3
HSC 0.52 0.84** 0
TRIratio 0.58 0.79* �0.01
MFCCratio 0.04 0.84** 0.1
ATS �0.79* �0.37 0.32
HSV 0.98*** 0.02 �0.02
2134 J. Acoust. Soc. Am. 141 (3), March 2017 Roz�e et al.
harshness among professional cellists. By using analysis-
synthesis techniques and suitable listening tests, this percep-
tually undesirable feature was modeled into a combination
of some relevant signal descriptors. First, a signal-based
approach using Gabor masks (Sec. II) revealed that a harsh
sound may essentially be characterized by three facets: an
energy transfer or formantic shift toward higher order parti-
als of its spectrum, a weakened ATS of its temporal enve-
lope, and more fluctuations in the harmonics’ life cycle.
Then, two complementary experiments enabled us to percep-
tually assess the effect of these signal features. On the one
hand, a least-square linear regression (Sec. III) performed on
the subjective ratings of continuously harsher sounds,
allowed us to build a predictive model of the harshness. This
model shed light on the spectro-temporal duality potentially
responsible for the harsh perception. On the other hand, a
MDS was carried out (Sec. IV), in order to perceptually vali-
date this signal duality, and better assess the harshness con-
tribution of each identified descriptor. To this aim, we
attempted to create rounder and harsher synthetic stimuli
simply by controlling their spectral and temporal envelope
shapes. The MDS, performed on the perceived dissimilarities
of pairs of these synthetic stimuli, were conclusive. As a
matter of fact, the method gave rise to a perceptual harshness
space, which two main dimensions were strongly correlated
to the acoustic descriptors identified by the predictive model.
The perceived harshness was thus partly due to the combina-
tion of a more chaotic harmonic behavior and a weaker
ATS, which corresponds to a more unstable Helmholtz
motion, building up more slowly. The shift in formant and
energy transfer toward the upper partials of the spectrum
appeared as corollaries of the loss in synchronicity between
signal transients.
It should be noted that the perceptual space created by
investigating the harshness phenomenon presents some inter-
esting analogies with the recognized timbre spaces in the lit-
terature (Grey, 1977; McAdams et al., 1995; Barthet et al.,2010). As a matter of fact, we can roughly consider by anal-
ogy with the terms of McAdams et al. (1995), our first
dimension as the “spectral fluctuations,” characterizing the
variations of synchronicity in the transients of harmonics.
Our second dimension may match with the “spectral energy
distribution,” and the third one may be interpreted in terms
of “temporal patterns” (even though this dimension is not
itself directly linked to harshness perception). We expect
similar trends for notes that differ from the one investigated
in this analysis, at least with respect to the signal descriptors.
Nevertheless, further investigations are needed to verify the
robustness of the perceived harshness phenomenon on other
parts of the score.
The results of this study naturally encourage the pursuit
of additional investigations of the gestures responsible for
this sound degradation. In particular, it seems important to
assess the coherency with physics-based synthesis models
(Demoucron, 2008), regarding the relationships between the
acoustical features composing the harshness and the control
gestures of the musician. Particularly, the transfer of spectral
energy responsible for the increase in brightness is likely to
coincide with a stronger bow force or a slower bow velocity
(Guettler et al., 2003; Guettler, 2003). Similarly, the weak
ATS is believed to match with an improper Helmholtz
motion, caused by a lack of firmness in the onset of the note
produced by the string–bow interaction. In addition to the
changes directly related to the physical parameters of the
bow, it could be very interesting to investigate how the pos-
tural constraint impacted the gestural coordination and con-
sequently the timbre of the sound. For example, we may ask
whether the cellists compensated their postural constraint by
consistently adjusting the joint movements of their right
arm, and if these specific adjustments resulted in the harsh
sounds that we collected for this study. If this is the case, it
might indicate that postural movements of the chest and the
head play an important role in motor coordination patterns
of the right arm’s joints, and indirectly in the maintenance of
a good (round) timbre quality.
A potential application of these results consists in
designing new technological tools that enable real-time mea-
sures of harshness. This technology-enhanced learning
would be of great interest for young cello students that are
not always aware of how their gestures can be optimized in
order to improve their sound quality. Inappropriate gestures
might hereby be corrected by the feedback of such devices.
ACKNOWLEDGMENTS
This work is partly supported by the French National
Research Agency and is part of the “Sonimove” project
(Grant No. ANR-14-CE24-0018). We would like to thank
the reviewers for their constructive comments related to the
manuscript.
Aramaki, M., Besson, M., Kronland-Martinet, R., and Ystad, S. (2011).
“Controlling the perceived material in an impact sound synthesizer,” IEEE
Trans. Audio, Speech, Lang. Process. 19(2), 301–314.
Aramaki, M., Gondre, C., Kronland-Martinet, R., Voinier, T., and Ystad, S.
(2009). “Thinking the sounds: An intuitive control of an impact sound syn-
thesizer,” in International Conference on Auditory Display (ICAD’09), pp.
119–124.
Barthet, M., Guillemain, P., Kronland-Martinet, R., and Ystad, S. (2010).
“From clarinet control to timbre perception,” Acta Acust. Acust. 96(4),
678–689.
Barthet, M., Kronland-Martinet, R., and Ystad, S. (2008). “Improving musi-
cal expressiveness by time-varying brightness shaping,” in ComputerMusic Modeling and Retrieval. Sense of Sounds (Springer-Verlag Berlin-
Heidelberg), pp. 313–336.
Barthet, M., Kronland-Martinet, R., Ystad, S., and Depalle, P. (2007). “The
effect of timbre in clarinet interpretation,” in International ComputerMusic Conference (ICMC) (Copenhagen, Denmark), pp. 59–62.
Bogert, B. P., Healy, M. J., and Tukey, J. W. (1963). “The quefrency alany-
sis of time series for echoes: Cepstrum, pseudo-autocovariance, cross-
cepstrum and saphe cracking,” in Proceedings of the Symposium on TimeSeries Analysis, Vol. 15, pp. 209–243.
Caclin, A., McAdams, S., Smith, B. K., and Winsberg, S. (2005). “Acoustic
correlates of timbre space dimensions: A confirmatory study using syn-
thetic tones,” J. Acoust. Soc. Am. 118(1), 471–482.
Chadefaux, D., Le Carrou, J.-L., Wanderley, M. M., Fabre, B., and Daudet,
L. (2013). “Gestural strategies in the harp performance,” Acta Acust.
Acust. 99(6), 986–996.
Chudy, M., P�erez Carrillo, A., and Dixon, S. (2013). “On the relation
between gesture, tone production and perception in classical cello
performance,” in Proceedings of Meetings on Acoustics ICA2013, Vol. 19,
No. 1, p. 035017.
Davis, S., and Mermelstein, P. (1980). “Experiments in syllable-based rec-
ognition of continuous speech,” IEEE Trans. Acoust., Speech Signal
Process. 28, 357–366.
J. Acoust. Soc. Am. 141 (3), March 2017 Roz�e et al. 2135
Demoucron, M. (2008). “On the control of virtual violins—Physical model-
ling and control of bowed string instruments,” Ph.D. thesis, Universit�ePierre et Marie Curie-Paris VI; Royal Institute of Technology, Stockholm.
Depalle, P., Kronland-Martinet, R., and Torr�esani, B. (2006). “Time-fre-
quency multipliers for sound synthesis,” in Proc. the Wavelet XIIConference (SPIE Annual Symposium, San Diego, CA), pp. 221–224.
Desmet, F., Nijs, L., Demey, M., Lesaffre, M., Martens, J.-P., and Leman,
M. (2012). “Assessing a clarinet player’s performer gestures in relation to
locally intended musical targets,” J. New Music Res. 41(1), 31–48.
Fritz, C., Blackwell, A. F., Cross, I., Woodhouse, J., and Moore, B. C.
(2012). “Exploring violin sound quality: Investigating English timbre
descriptors and correlating resynthesized acoustical modifications with
perceptual properties,” J. Acoust. Soc. Am. 131(1), 783–794.
Genesis (2009). “Loudness toolbox,” available at http://genesis-acoustics.-
com/en/loudness_online-32.html (Last viewed January 3, 2017).
Grey, J. M. (1977). “Multidimensional perceptual scaling of musical
timbres,” J. Acoust. Soc. Am. 61(5), 1270–1277.
Grey, J. M., and Gordon, J. W. (1978). “Perceptual effects of spectral modi-
fications on musical timbres,” J. Acoust. Soc. Am. 63(5), 1493–1500.
Guettler, K. (2002). “On the creation of the Helmholtz motion in bowed
strings,” Acta Acust. Acust. 88(6), 970–985.
Guettler, K. (2003). “A closer look at the string player’s bowing gestures,”
CASJ (Series II) 4(7), 12–16.
Guettler, K., Schoonderwaldt, E., and Askenfelt, A. (2003). “Bow speed or
bowing position—Which one influence spectrum the most?,” in
Proceedings of the Stockholm Music Acoustic Conference (SMAC).ITU-R Recommendation, B. S. 1534-1 (2003). “Method for the subjective
assessment of intermediate quality levels of coding systems (MUSHRA),”
Technical Report, International Telecommunication Union, Geneva,
Switzerland.
Kim, H.-G., Moreau, N., and Sikora, T. (2006). MPEG-7 Audio andBeyond: Audio Content Indexing and Retrieval (John Wiley & Sons,
Hoboken, NJ).
Lartillot, O., and Toiviainen, P. (2007). “A MATLAB toolbox for musical fea-
ture extraction from audio,” in International Conference on Digital AudioEffects, pp. 237–244.
Leman, M. (2008). Embodied Music Cognition and Mediation Technology(MIT Press, Cambridge, MA).
McAdams, S. (1999). “Perspectives on the contribution of timbre to musical
structure,” Comput. Music J. 23(3), 85–102.
McAdams, S., Winsberg, S., Donnadieu, S., De Soete, G., and Krimphoff, J.
(1995). “Perceptual scaling of synthesized musical timbres: Common
dimensions, specificities, and latent subject classes,” Psychol. Res. 58(3),
177–192.
Merer, A., Ystad, S., Kronland-Martinet, R., and Aramaki, M. (2007).
“Semiotics of sounds evoking motions: Categorization and acoustic
features,” in International Symposium on Computer Music Modeling andRetrieval (Springer, Berlin Heidelberg), pp. 139–158.
Olivero, A., Torr�esani, B., and Kronland-Martinet, R. (2010). “A new
method for Gabor multipliers estimation: Application to sound morphing,”
in European Signal Processing Conference (EUSIPCO 2010), Aalborg,
Denmark, pp. 507–511.
Peeters, G. (2004). “A large set of audio features for sound description (sim-
ilarity and classification) in the CUIDADO project,” Technical Report,
IRCAM.
Poli, G. D., and Prandoni, P. (1997). “Sonological models for timbre charac-
terization,” J. New Music Res. 26(2), 170–197.
Pollard, H. F., and Jansson, E. V. (1982). “A tristimulus method for the
specification of musical timbre,” Acta Acust. Acust. 51(3), 162–171.
Ramsay, J. O. (2006). Functional Data Analysis (John Wiley & Sons, Inc.,
Hoboken, NJ).
Roz�e, J. (2017). “Exploring the perceived harshness of cello sounds by
morphing and synthesis techniques,” available at http://www.lma.cnrs-
mrs.fr/~kronland/Cello_Harshness (Last viewed January 3, 2017).
Roz�e, J., Aramaki, M., Kronland-Martinet, R., Voinier, T., Bourdin, C.,
Chadefaux, D., Dufrenne, M., and Ystad, S. (2016). “Assessing the influ-
ence of constraints on cellists’ postural displacements and musical
expressivity,” in Music, Mind and Embodiment—Post-Proceedings ofCMMR (Springer, New York), Vol. 9617 of LNCS, pp. 22–41.
Schaeffer, P. (1966). Trait�e des Objets Musicaux (Treaty of MusicalObjects) (Editions du Seuil, Paris).
Schelleng, J. C. (1973). “The bowed string and the player,” J. Acoust. Soc.
Am. 53(1), 26–41.
Schoonderwaldt, E. (2009). “The violinist’s sound palette: Spectral centroid,
pitch flattening and anomalous low frequencies,” Acta Acust. Acust.
95(5), 901–914.
Schoonderwaldt, E., Guettler, K., and Askenfelt, A. (2003). “Effect of the
width of the bow hair on the violin string spectrum,” in Proceedings of theStockholm Music Acoustics Conference (SMAC).
Sciabica, J.-F., Olivero, A., Roussarie, V., Ystad, S., and Kronland-
Martinet, R. (2012). “Dissimilarity test modelling by time-frequency rep-
resentation applied to engine sound,” in Audio Engineering SocietyConference: 45th International Conference: Applications of Time-Frequency Processing in Audio.
Søndergaard, P. L., Torr�esani, B., and Balazs, P. (2012). “The linear time
frequency analysis toolbox,” Int. J. Wavelets, Multiresol. Inform. Process.
10(04), 1250032.
Stepanek, J. (2006). “Musical sound timbre: Verbal description and
dimensions,” in Proceedings of the 9th International Conference onDigital Audio Effects (DAFx-06), pp. 121–126.
Stepanek, J., and Otcenasek, Z. (2005). “Acoustical correlates of the main
features of violin timbre perception,” in Proceedings of the Conference onInterdisciplinary Musicology, pp. 1–9.
Thompson, M. R., and Luck, G. (2012). “Exploring relationships between
pianists’ body movements, their expressive intentions, and structural ele-
ments of the music,” Musicae Sci. 16(1), 19–40.
Van Zijl, A. G., and Luck, G. (2013). “Moved through music: The effect of
experienced emotions on performers’ movement characteristics,” Psychol.
Music 41(2), 175–197.
Wanderley, M. M., Vines, B. W., Middleton, N., McKay, C., and Hatch, W.
(2005). “The musical significance of clarinetists’ ancillary gestures: An
exploration of the field,” J. New Music Res. 34(1), 97–113.
Z€olzer, U. (2008). Digital Audio Signal Processing (John Wiley & Sons,
Hoboken, NJ).
2136 J. Acoust. Soc. Am. 141 (3), March 2017 Roz�e et al.
Top Related