Time-Frequency Analysis of the Shepard Tone - …math.bard.edu/student/pdfs/shun-yang-lee.pdf ·...
Transcript of Time-Frequency Analysis of the Shepard Tone - …math.bard.edu/student/pdfs/shun-yang-lee.pdf ·...
Time-Frequency Analysis of the ShepardTone
A Senior Project submitted toThe Division of Science, Mathematics, and Computing
ofBard College
byShun-Yang Lee
Annandale-on-Hudson, New YorkApril, 2010
Abstract
This project focuses on the time-frequency analysis of the Shepard Tone. The main tech-niques used in this project are Fourier Transform and other related mathematical tools.From the analysis we are able to state the general properties of the Shepard Tone, whichenable us to reconstruct our own Shepard Tones. We conclude that the choice of envelopefunctions, the spacing between sound threads, and the sound threads being parallel or notdecide the degree of illusiveness of a Shepard Tone.
Contents
Abstract 1
Dedication 5
Acknowledgments 6
1 Introduction 7
2 Background Research 92.1 Orthogonal Systems of Functions . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Fourier Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Dirichlet’s Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Linearity Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.2 Time-differentiation Property . . . . . . . . . . . . . . . . . . . . . . 162.3.3 Time-shift Property . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.4 Frequency-shift Property . . . . . . . . . . . . . . . . . . . . . . . . 172.3.5 Fourier Transform Pair . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.6 Symmetry Property . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Dirichlet’s Conditions for the Fourier Integral . . . . . . . . . . . . . . . . . 182.5 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Convolution in time . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.5.2 Convolution in frequency . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Gibbs Phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.7 Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.8 Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.9 Windowed Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Contents 3
2.10 Physics of Sound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Preliminary Investigation of the Shepard Tone 243.1 Using SoundRuler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Using Tone Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 About Tone Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4 Some Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Reconstructing the Shepard Tone 334.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Choice of Envelope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3 Setup 1: Parallel Threads Equally Spaced in Time . . . . . . . . . . . . . . 34
4.3.1 Octaves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.3.2 Major Sevenths (M7) . . . . . . . . . . . . . . . . . . . . . . . . . . 394.3.3 Minor Sevenths (m7) . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3.4 Major Sixths (M6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3.5 Minor Sixths (m6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.3.6 Perfect Fifths (P5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.3.7 Augmented Fourths (A4) . . . . . . . . . . . . . . . . . . . . . . . . 414.3.8 Perfect Fourths (P4) . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3.9 Major Thirds (M3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3.10 Minor Thirds(m3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3.11 Major Seconds (M2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3.12 Minor Seconds (m2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3.13 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Setup 2: Parallel VS Non-Parallel Threads . . . . . . . . . . . . . . . . . . . 444.4.1 Case 1: Parallel Sound Threads . . . . . . . . . . . . . . . . . . . . . 464.4.2 Case 2: Non-Parallel Sound Threads A . . . . . . . . . . . . . . . . . 474.4.3 Case 3: Non-Parallel Sound Threads B . . . . . . . . . . . . . . . . . 484.4.4 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5 Conclusion 56
6 Future Work 58
A Tone Analyzer 59
B Shepard Tone Generator 60
Bibliography 62
List of Figures
2.6.1 Gibbs phenomenon (from http://cnx.org/content/m28717/latest/hv12.jpg) 21
3.1.1 The 48.8’th Second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.1.2 The 49.2’th Second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.1.3 The 49.6’th Second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.1.4 The 50.0’th Second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.1.5 The 50.4’th Second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.1.6 The 50.8’th Second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2.1 Frequency vs Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.2 Log(Frequency) vs Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.3 At the 41.1 second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.1 Envelope=t10(t− 50)6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3.1 Harmonics (from http://www.lamadeguido.com/Image4.gif) . . . . . . . . . 364.3.2 Helix Representation of Pitches
(from http://www1.appstate.edu/˜kms/classes/psy3203/MusicIllusions/helix.gif) 374.4.1 Case 1: Frequency vs Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4.2 Case 1: 14.7 second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.4.3 Case 1: 16.8 second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.4.4 Case 2: Frequency vs Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.4.5 Case 2: 14.7 second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.4.6 Case 2: 16.8 second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.4.7 Case 3: Frequency vs Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.4.8 Case 3: 14.7 second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.4.9 Case 3: 16.8 second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Acknowledgments
I would like to thank my advisor, Cliona Golden, for her guidance on this project andcountless advice she has given me that has been invaluable both inside and outside theclassroom. I would like to thank Sam Hsiao and Melvin Chen for being on my board andgiving me helpful suggestions. I would also like to express my gratitude to John Halle,without whom I would have never known that the Shepard Tone actually exists. Lastly, Iwant to thank all my friends at Bard who make my college experience unforgettable.
1Introduction
An audio illusion is one that confuses the listener’s auditory mechanisms which leads to
illusionary perceptions. Some audio illusions manipulate the harmonics of certain funda-
mental frequencies; some illusions make use of two or more independent audio channels
to create illusionary effects. Examples of audio illusions include the Glissando Illusion,
the Scale Illusion, the Octave Illusion, the Tritone Paradox, and the Shepard Tone. This
project focuses on the Shepard Tone and explores its properties.
The term Shepard Tone is often referred to the audio illusion where different sound
threads are superposed an octave, or octaves, apart. In this project we consider the gen-
eralized Shepard Tone where sound threads which are any intervals apart are taken into
consideration.
Chapter 2 discusses the Fourier Transform and other related mathematical tools that
are used in this project to perform a time-frequency analysis of the Shepard Tone.
Chapter 3 describes our investigations of the Shepard Tone. We first utilize a commercial
software SoundRuler to perform an initial analysis of James Tenney’s Shepard Tone piece
1. INTRODUCTION 8
‘For Ann’, and then use Tone Analyzer, written in MATLAB for the purpose of this
project, to perform our time-frequency analysis of this Shepard Tone piece.
Chapter 4 discusses the reconstruction of the Shepard Tone. Setup 1 focuses on Shepard
Tones with parallel threads that are equally spaced in time. A distance measure is used
to describe the degree of illusiveness in each of the settings in setup 1. Setup 2 focuses on
Shepard Tones with non-parallel threads. A variance-related measurement is introduced
in order to describe the degree of illusiveness in each of the settings in setup 2.
Chapter 5 lists the properties of the Shepard Tone that we discovered from the investi-
gations in the previous chapters.
Finally, chapter 6 describes some topics for future research.
2Background Research
In the field of audio signal processing, computers are often utilized to analyze audio sig-
nals. We are able to analyze any audio signal by converting the input time-domain function
(i.e. the input audio signal) into a frequency-domain function. Fourier Transform is the
standard tool for describing the frequency content of an audio signal. However, as we
will see later, it is eventually inadequate for the purposes of this project and it will be
replaced by the Short-time Fourier Transform (also called the Windowed Fourier Trans-
form). Nonetheless, the Fourier Transform forms the basis of our toolbox, so we will start
by introducing this transform.
2.1 Orthogonal Systems of Functions
In this section we are going to describe the notion of orthogonality. To do this, we first
provide some preliminary definitions on which it relies.
Definition 2.1.1. Let H be a subspace of a vector space V . An indexed set of vectors
B = {b1,b2, . . . ,bp} in V is a basis for H if
• B is a linearly independent set, and
2. BACKGROUND RESEARCH 10
• the subspace spanned by B coincides with H; that is,
H = Span{b1,b2, . . . ,bp}.
4
Definition 2.1.2. An inner product of two vectors a = [a1, a2, . . . , an] and b =
[b1, b2, . . . , bn] in Rn is defined as
〈a,b〉 = a1b1 + a2b2 + · · ·+ anbn.
4
Definition 2.1.3. A set of vectors {u1,u2, . . . ,up} in Rn, with p ≤ n, is said to be an
orthogonal set if each pair of distinct vectors from the set is orthogonal, that is, if
〈ui,uj〉 = 0 whenever i 6= j.
4
Here we cite a theorem from [3, P.384]
Theorem 2.1.4. If S = {u1,u2, . . . ,up} is an orthogonal set of nonzero vectors in Rn,
then S is linearly independent and hence is a basis for the subspace spanned by S.
Proof. If 0 = c1u1 + · · ·+ cpup for some scalars c1, c2, · · · , cp, then
0 = 0 · u1 = (c1u1 + c2u2 + · · ·+ cpup) · u1 (2.1.1)
= (c1u1) · u1 + (c2u2) · u1 + · · ·+ (cpup) · u1 (2.1.2)
= c1(u1 · u1) + c2(u2 · u1) + · · ·+ cp(up · u1) (2.1.3)
= c1(u1 · u1), (2.1.4)
since u1 is orthogonal to u2,u3, · · · ,up. Since u1 is nonzero, u1 · u1 is not zero and so c1
must be zero. Similarly, c2, c3, · · · , cp must be zero. Therefore S is linearly independent
and is a basis for the subspace spanned by S.
2. BACKGROUND RESEARCH 11
The set of Riemann-integrable real-valued functions defines a vector space. We can
define an inner product on this space as follows.
Definition 2.1.5. If f and g are Riemann-integrable, real-valued functions that are de-
fined on [a, b] ⊂ R, then the integral ∫ b
af(x)g(x)dx
defines the inner product of f and g, denoted by 〈f, g〉.
The non-negative number 〈f, f〉12 , denoted by ||f ||, is the norm of f . 4
From this, we get the definition of orthogonality which can now be stated for a countably
infinite set of functions.
Definition 2.1.6. Two functions f and g are said to be orthogonal if 〈f, g〉 = 0. Sim-
ilarly, let S={f1, f2, f3, ...} be a collection of Riemann-integrable functions on [a, b], then
S is said to be an orthogonal system on [a, b] if
〈fm, fn〉 = 0 whenever m 6= n.
Furthermore, if fn has norm 1 for each n, then the system is an orthonormal system
on [a, b].
4
2.2 Fourier Series
The simplest form of Fourier Transform is Fourier Series, in which we consider only time-
domain functions that are periodic on [0, 1] as inputs. We will proceed with a special
trigonometric system S = {φ0, φ1, · · · } for which
φ0(x) = 1, φ2n−1(x) =√
2 cos 2πnx, φ2n(x) =√
2 sin 2πnx, n = 1, 2, . . .
We will prove that S is an orthonormal system on the interval [0, 1].
2. BACKGROUND RESEARCH 12
Proof. • Firstly we will show that 〈φ2m, φ2n〉 =∫ 10
√2 sin (2πmx)
√2 sin (2πnx) = 0
whenever m 6= n. ∫ 1
0
√2 sin (2πmx)
√2 sin (2πnx)dx
= 2∫ 1
0
−12
[cos (2π(m+ n)x)− cos (2π(m− n)x)] dx (2.2.1)
= −[∫ 1
0cos (2π(m+ n)x)dx−
∫ 1
0cos (2π(m− n)x)dx
](2.2.2)
= −[
sin (2π(m+ n)x)2π(m+ n)
− sin (2π(m− n)x)2π(m− n)
]1
0
(2.2.3)
= 0, (2.2.4)
since m+ n and m− n are integers.
• Similarly, it can be shown that
〈φ2m−1, φ2n−1〉 =∫ 1
0
√2 cos (2πmx)
√2 cos (2πnx) (2.2.5)
= 0 whenever m 6= n, (2.2.6)
〈φ2m, φ2n−1〉 =∫ 1
0
√2 sin (2πmx)
√2 cos (2πnx)dx (2.2.7)
= 0 for all m and n. (2.2.8)
For n = 1, 2, . . . , we have the following.
‖φ2n‖2 = 〈φ2n, φ2n〉 =∫ 1
0
√2 sin (2πnx)
√2 sin (2πnx)dx (2.2.9)
= 1, (2.2.10)
‖φ2n−1‖2 = 〈φ2n−1, φ2n−1〉 =∫ 1
0
√2 cos (2πnx)
√2 cos (2πnx)dx (2.2.11)
= 1. (2.2.12)
We also know that
〈φ0, φ0〉 =∫ 1
012dx = 1
2. BACKGROUND RESEARCH 13
Therefore
〈φ2m, φ2n〉 = 〈φ2m−1, φ2n−1〉 = δm,n
for n,m = 1, 2, . . . , and
‖φ2m‖ = ‖φ2n−1‖ = 1
for m = 0, 1, 2, . . . , n = 1, 2, . . . .
• The equations above constitute the orthogonality relations for the φn’s, and show
that the set of functions
φ0(x) = 1, φ2n−1(x) =√
2 cos 2πnx, φ2n(x) =√
2 sin 2πnx, n = 1, 2, . . .
is an orthonormal system.
Given f in C [0, 1], where C [0, 1] denotes all the continuous functions on the interval
[0, 1], we can find the n-th order Fourier approximation to f on [0, 1] by calculating
the orthogonal projection of f onto the orthonormal basis {φk}, where
φ0(x) = 1, φ2n−1(x) =√
2 cos 2πnx, φ2n(x) =√
2 sin 2πnx, n = 1, 2, . . . .
The Fourier coefficients of f are given by
a0 = 〈f, φ0〉 =∫ 1
0f(t)dt, (2.2.13)
ak = 〈f, φ2k−1〉 =∫ 1
0f(t)√
2 cos (2πkt)dt, k ≥ 1, (2.2.14)
bk = 〈f, φ2k〉 =∫ 1
0f(t)√
2 sin (2πkt)dt, k ≥ 1. (2.2.15)
The Fourier Series of f on C [0, 1] is then defined by:
f(t) = a0 +∞∑m=1
{am√
2 cos 2πmt+ bm√
2 sin 2πmt}, (2.2.16)
where a0, am and bm, m = 1, 2, . . . , are the Fourier Coefficients.
2. BACKGROUND RESEARCH 14
Fourier Series can also be written in exponential form, using Euler’s formula.
eiθ = cos(θ) + i sin(θ),
where i is√−1.
Therefore, the exponential form of the Fourier Series expansion of f is
f(t) =∞∑
n=−∞Fne
i2πnt, (2.2.17)
where
Fn = F (n) =∫ 1
0f(t)e−i2πntdt.
We can extend this definition to a generalized version where the period of the functions
is T as follows.
f(t) =∞∑
n=−∞Fne
i 2πTnt, (2.2.18)
where
Fn = F (n) =1T
∫ 1
0f(t)e−i
2πTntdt.
2.2.1 Dirichlet’s Conditions
It turns out that functions which satisfy certain conditions will have a convergent Fourier
Series expansion. Citing from [1, P.286], the conditions known as Dirichlet’s conditions
are stated in the following theorem.
Theorem 2.2.1. Dirichlet’s Conditions
If f(t) is a bounded, periodic function that in any period has
• a finite number of isolated maxima and minima, and
• a finite number of points of finite discontinuity
then the Fourier Series expansion of f(t) converges to f(t) at all points where f(t) is
continuous and to the average of the right- and left-hand limits of f(t) at points where
f(t) is discontinuous.
2. BACKGROUND RESEARCH 15
2.3 Fourier Transform
For input audio signals that are not periodic, the more general Fourier Transform is often
used. The Fourier Transform of a time-domain function f(t) is defined as
(Ff)(ξ) = F (ξ) =∫ ∞−∞
f(t)e−i2πξtdt. (2.3.1)
We can represent a time-domain input function in terms of frequency-domain functions
by the following equation:
f(t) =∫ ∞−∞
F (ξ)ei2πξtdξ. (2.3.2)
The Fourier Transform has the following properties:
• Linearity Property
• Time-differentiation Property
• Time-shift Property
• Frequency-shift Property
• Symmetry Property
which are explained below.
2.3.1 Linearity Property
If f(t) and g(t) are functions having Fourier Transforms F (ξ) and G(ξ), respectively, and
if α and β are constants, then
(F(αf + βg))(ξ) = α(Ff)(ξ) + β(Fg)(ξ) = αF (ξ) + βG(ξ).
2. BACKGROUND RESEARCH 16
Proof.
(F(αf + βg))(ξ) =∫ ∞−∞
[αf(t) + βg(t)] e−i2πξtdt (2.3.3)
= α
∫ ∞−∞
f(t)e−i2πξtdt+ β
∫ ∞−∞
g(t)e−i2πξt (2.3.4)
= αF (ξ) + βG(ξ). (2.3.5)
2.3.2 Time-differentiation Property
If the function f(t) has a Fourier Transform F (ξ) then
f(t) =∫ ∞−∞
F (ξ)ei2πξtdξ.
If we differentiate f with respect to t then we will get
df
dt=∫ ∞−∞
∂
∂t
[F (ξ)ei2πξt
]dξ =
∫ ∞−∞
(i2πξ)F (ξ)ei2πξtdξ,
which implies that dfdt is the inverse Fourier Transform of (i2πξ)F (ξ). This means
F{df
dt
}= (i2πξ)F (ξ).
By repeatedly taking the derivatives we get
F{dnf
dtn
}= (i2πξ)nF (ξ).
2.3.3 Time-shift Property
If a function f(t) has Fourier Transform F (ξ) and we set g(t) = f(t − τ) to be a shifted
version of f(t), then
(Fg)(ξ) = e−i2πξτF (ξ).
Proof.
(Fg)(ξ) =∫ ∞−∞
g(t)e−i2πξtdt =∫ ∞−∞
f(t− τ)e−i2πξtdt
2. BACKGROUND RESEARCH 17
Let x = t− τ , then we get
(Fg)(ξ) =∫ ∞−∞
f(x)e−i2πξ(x+τ)dx = e−i2πξτ∫ ∞−∞
f(x)e−i2πξxdx = e−i2πξτF (ξ).
2.3.4 Frequency-shift Property
If the function f(t) has Fourier Transform F (ξ), then the Fourier Transform of g(t) =
ei2πξ0tf(t) is
(F(ei2πξ0tf))(ξ) = F (ξ − ξ0).
Proof.
(Fg)(ξ) =∫ ∞−∞
ei2πξ0tf(t)e−i2πξtdt =∫ ∞−∞
f(t)e−i2π(ξ−ξ0)tdt = F (ξ − ξ0).
2.3.5 Fourier Transform Pair
Recall that the Fourier Transform of a function f(t) is defined by
(Ff)(ξ) = F (ξ) =∫ ∞−∞
f(t)e−i2πξtdt,
whenever the integral exists. Now we define the inverse Fourier Transform of G(ξ) as
(F−1G)(t) = g(t) =∫ ∞−∞
G(ξ)ei2πξtdξ.
{f,Ff} is called a Fourier Transform pair.
2.3.6 Symmetry Property
It is a claimed result that the inverse Fourier Transform is indeed the inverse of the Fourier
Transform.
(F−1F )(t) = f(t) =∫ ∞−∞
F (ξ)ei2πξtdξ.
2. BACKGROUND RESEARCH 18
Replacing the dummy variable ξ with y we get
f(t) =∫ ∞−∞
F (y)ei2πytdy.
Therefore,
f(−t) =∫ ∞−∞
F (y)e−i2πytdy.
Now replacing t by ξ we get
f(−ξ) =∫ ∞−∞
F (y)e−i2πξydy.
Notice that the right hand side of the above equation is the Fourier Transform of F (y).
This means, given that
(Ff)(ξ) = F (ξ),
we can conclude
F2f(ξ) = f(−ξ).
This can also be written as (F−1f)(ξ) = (Ff)(−ξ), and is called the symmetry property.
2.4 Dirichlet’s Conditions for the Fourier Integral
Citing from [1, P.347], a set of conditions for the validity of Fourier integral representations
are stated in the following theorem.
Theorem 2.4.1. Dirichlet’s Conditions for the Fourier integral
If the function f(t) is such that
• it is absolutely integrable, so that
∫ ∞−∞|f(t)|dt <∞,
and,
2. BACKGROUND RESEARCH 19
• it has at most a finite number of maxima and minima and a finite number of dis-
continuities in any finite interval,
then the Fourier integral representation of f(t) converges to f(t) at all points where f(t)
is continuous and to the average of the right- and left-hand limits of f(t) where f(t) is
discontinuous.
2.5 Convolution
The concept of convolution is defined as follows.
Definition 2.5.1. The convolution of two functions f(t) and g(t), denoted by f ∗ g, is
defined by
(f ∗ g)(t) =∫ ∞−∞
f(τ)g(t− τ)dτ (2.5.1)
=∫ ∞−∞
f(t− τ)g(τ)dτ. (2.5.2)
4
2.5.1 Convolution in time
If functions u(t) and v(t) have Fourier Transforms U(ξ) and V (ξ), respectively, then the
Fourier Transform of the convolution y(t) = (u ∗ v)(t) is
(Fy)(ξ) = (F(u ∗ v))(ξ) = (F(v ∗ u))(ξ) = (UV )(ξ).
Proof.
(Fy)(ξ) = (F(u ∗ v))(ξ) =∫ ∞−∞
[∫ ∞−∞
u(τ)v(t− τ)dτ]e−i2πξtdt (2.5.3)
=∫ ∞−∞
u(τ)[∫ ∞−∞
e−i2πξtv(t− τ)dt]dτ. (2.5.4)
2. BACKGROUND RESEARCH 20
Now replacing t− τ with z we get
Y (ξ) =∫ ∞−∞
u(τ)[∫ ∞−∞
e−i2πξ(z+τ)v(z)dt]dτ (2.5.5)
=∫ ∞−∞
u(τ)e−i2πξτdτ∫ ∞−∞
e−i2πξzv(z)dz (2.5.6)
= (UV )(ξ). (2.5.7)
That is, a convolution in the time domain is transformed into a product in the frequency
domain.
2.5.2 Convolution in frequency
If (Fu)(ξ) = U(ξ) and (Fv)(ξ) = V (ξ), then
(F(uv))(ξ) = (U ∗ V )(ξ)
Proof. Let u(t) =∫∞−∞ U(ξ)ei2πξtdξ and v(t) =
∫∞−∞ V (ξ)eiπξtdξ, then
(F−1(U ∗ V ))(t) =∫ ∞−∞
ei2πξt[∫ ∞−∞
U(y)V (ξ − y)dy]dξ (2.5.8)
=∫ ∞−∞
U(y)[∫ ∞−∞
V (ξ − y)ei2πξtdξ]dy. (2.5.9)
Now replacing ξ − y by z and we get
(F−1(U ∗ V ))(t) =∫ ∞−∞
U(y)[∫ ∞−∞
V (z)ei2π(z+y)tdz
]dy (2.5.10)
=∫ ∞−∞
U(y)ei2πytdy∫ ∞−∞
V (z)ei2πztdz (2.5.11)
= (uv)(t). (2.5.12)
Therefore,
(F(uv))(ξ) = (U ∗ V )(ξ).
This means that a multiplication in the time domain is transformed into a convolution
in the frequency domain.
2. BACKGROUND RESEARCH 21
2.6 Gibbs Phenomenon
Figure 2.6.1 (from http://cnx.org/content/m28717/latest/hv12.jpg) shows one deficiency
of Fourier Transform, the Gibbs Phenomenon, which describes the overshoot and un-
dershoot that occur when trying to transform a time-domain function with discontinuities,
jumps and sudden changes. In order to capture the sudden change, many high frequency
oscillations will occur, which creates undesirable artifacts.
Figure 2.6.1. Gibbs phenomenon (from http://cnx.org/content/m28717/latest/hv12.jpg)
2.7 Discrete Fourier Transform
We know that computers deal with discrete data and that, when a recording is made, it is
sampled discretely in time (for CD format the sampling rate is 44100 samples per second).
We have to use the discrete form of the Fourier Transform to meet our requirements. Citing
from [1, P.389], suppose we have a sequence {gk} of N samples drawn from a continuous-
time signal g(t), at equal intervals T , the discrete Fourier Transform (DFT) pair can
be written as
Gk =N−1∑n=0
gne− 2πi
Nkn,
2. BACKGROUND RESEARCH 22
gn =1N
N−1∑k=0
Gke2πiNkn.
where {Gk} is roughly sampling the continuous Fourier Transform at equal intervals 1NT .
We see a high degree of similarity by comparing the discrete Fourier Transform with the
continuous Fourier Transform (equation 2.3.1). The main difference is that the discrete
Fourier Transform modifies the integration, which is in the continuous form, to a finite
sum, which enables the computer to process sampled audio signals.
2.8 Aliasing
Sometimes different signals become indistinguishable after being sampled due to the choice
of sampling frequencies. This means that, when trying to recover the original signals from
the sampled versions, the resulting continuous signals are different from the original signals.
For example, if we undersample an audio signal, then the recovered signal may include
undesired lower frequency oscillations. This is because, by undersampling, we have failed
to capture all higher frequency information in the signal. It turns out that as long as the
sampling frequency is no less than a quantity called the Nyquist frequency then the
aliasing effect can be avoided. The Nyquist frequency is defined as 2 ∗ fmax, i.e.,
fNyq = 2 ∗ fmax,
where fmax is the highest frequency for which the Fourier Transform is nonzero.
2.9 Windowed Fourier Transform
Sometimes we are interested in the behavior of functions when localized in time, i.e.,
we would like to know the behavior of functions over some very short period of time.
For this purpose we have the Windowed Fourier Transform, or Short-time Fourier
Transform. The Windowed Fourier Transform is defined as
2. BACKGROUND RESEARCH 23
X(τ, ξ) =∫ ∞−∞
x(t)W (t− τ)e−i2πξtdt,
where W (t) is the window function, which could be, for example, a Rectangle, Hamming,
Gaussian, Hanning, or Blackman window, etc.; x(t) is the input time-domain function that
is to be transformed. We can interpret X(τ, ξ) as the Fourier Transform of the windowed
function W (t− τ)x(t).
We can measure the energy of the function in a certain time-frequency neighborhood.
One way to measure the energy density in some time-frequency neighborhood is the spec-
trogram, which is denoted PS and is defined as
(PSx)(τ, ξ) = |X(τ, ξ)|2 =∣∣∣∣∫ ∞−∞
x(t)W (t− τ)e−i2πξtdt∣∣∣∣2 .
In practice, for functions that have been sampled, we apply the discrete Fourier Trans-
form, rather than the Fourier Transform, along with a discrete window.
2.10 Physics of Sound
The ‘pitch’ of a soundwave is determined by its frequency. A soundwave with a higher
frequency will be heard ‘higher’ in pitch than one with a lower frequency. A normal
human being can perceive soundwaves with frequencies ranging from approximately 20
Hz to 15000 Hz.
Definition 2.10.1. A sound thread is a continuous sound with its frequency and am-
plitude varying continuously over time. 4
Definition 2.10.2. A pitch class is the set of all pitches that are some integer multiple
of an octave apart. For example, the pitch class of D is the set of all D’s in different
octaves. 4
3Preliminary Investigation of the Shepard Tone
3.1 Using SoundRuler
Using the commercial software SoundRuler, we perform an initial analysis of James Ten-
ney’s famous Shepard Tone piece: ‘For Ann’. Firstly we load the input .wav file into
SoundRuler and partition the piece into 120 0.2-second-subsections. Secondly we generate
figures that represent the frequency-amplitude relation at the 48.8, 49.2, 49.6, 50.0, 50.4,
and 50.8 second. We display the figures below.
Figure 3.1.1 shows the frequency-amplitude relation of James Tenney’s ‘For Ann’ at the
48.8 second. We can see that there are eight clear peaks appearing, with the leftmost peak
having the largest amplitude. Generally speaking, at this stage each peak has a larger
amplitude compared to peaks with higher frequencies. In Figure 3.1.2, we see that this
relationship among peaks does not hold so clearly at the 49.2 second. Nonetheless, we
do notice that the rightmost peak shifted out of the observation window during 48.8-49.2
seconds.
Comparing Figure 3.1.2 and Figure 3.1.3 we notice that the three rightmost peaks
noticeably shifted to their right, which means the frequencies increased and listeners would
3. PRELIMINARY INVESTIGATION OF THE SHEPARD TONE 25
hear the sound ascend. The same phenomenon can be observed by comparing Figure 3.1.3
with Figure 3.1.4.
Another thing worth noticing is the number of peaks in each figure. Table 3.1.1 sum-
marizes this information. Notice that the number of peaks in each figure remains fairly
constant. This shows that, whenever the rightmost peaks shift out of the frequency range
which we focus on, some peaks with lower frequencies will appear to replace them, so the
total number of peaks does not decrease.
3.2 Using Tone Analyzer
3.2.1 About Tone Analyzer
We use code we wrote in MATLAB, which we call Tone Analyzer, to analyze sound files.
The code of Tone Analyzer can be found in Appendix A. It firstly loads a .wav file into
MATLAB, and then calculates relevant parameters. More specifically, the parameters it
calculates are
• Length of the sound file, L.
• Number of samples in a window, M.
• Number of points to sample the Fourier Transform at, N.
• Sampling rate, fs.
Figure Number of Peaks1.3.1 81.3.2 71.3.3 61.3.4 81.3.5 71.3.6 7
Table 3.1.1. Number of Peaks in Figure 1.3.1-1.3.6
3. PRELIMINARY INVESTIGATION OF THE SHEPARD TONE 26
• Number of frames that need to be formed in order to chop the sound file and apply
window to, nframe.
We load James Tenney’s ‘For Ann’ into Tone Analyzer. The values of the parameters
found by Tone Analyzer are listed in Table 3.2.1.
Using the parameters calculated, Tone Analyzer will construct frames, i.e., segments of
the original audio signal that have equal length, that the designed window will be applied
to. Here we use W = 0.5(1− cos(2πx)) to be our window.
Each frame captures 0.1 seconds of audio signal, and a Windowed Fourier Transform
is performed. The moduli of the output signal are calculated and we plot three figures to
show the results.
Figure 3.2.1 reveals the time-frequency relation of the input audio signal. The horizontal
axis represents the time, the vertical axis represents the frequency, and the color represents
the amplitude, with blue being lower amplitude and red being higher amplitude. One
can notice that there are about seventeen overlapping sound threads appearing during
the whole piece, and they are equally spaced in time. The color change of the threads
corresponds to the change in amplitude. We also notice that, from Figure 3.2.1, these
threads’ frequencies seem to be increasing exponentially. We next plot another figure,
Figure 3.2.2, this time with the vertical axis representing Log-frequency. It is obvious
that these ‘sound threads’ become straight lines after taking log, confirming that they are
indeed exponential functions.
Parameter Value DescriptionL 1248815 Length of the sound fileM 2206 Number of samples in a windowN 2206 Number of points to sample the Fourier Transform atfs 22050 Sampling Rate
nframe 5640 Number of frames needed to chop the sound file
Table 3.2.1. Parameters of ‘For Ann’
3. PRELIMINARY INVESTIGATION OF THE SHEPARD TONE 27
The last figure, Figure 3.2.3, shows the frequency-amplitude relation at the 41.1 second.
3.3 Observations
From Figure 3.2.1 and 3.2.2 we observe that there are multiple equally-spaced time threads
that make up the Shepard Tone. We can see that the frequencies of the threads exhibit
exponential growth. The amplitude of each thread varies over time: it starts with lower
amplitude and grows as time passes until it reaches the highest amplitude, and decays
afterward.
3.4 Some Explanations
The manipulation of frequencies and amplitudes of the sound threads seems to be the
source of the illusion. In fact, Shimizu’s research [4] shows that, when listening to a Shepard
Tone, the frequency closest to the previous frequency that was being heard is perceived,
as long as both of the frequencies lie within the listener’s sensitive range of perception,
which is typically between 500 Hz and 5000 Hz. Once the frequency on which the listener
concentrates passes the sensitive range the listener’s attention will automatically shift back
to a different frequency which lies within the sensitive range (Shimizu et al., 2007). Also,
the research [4] claimes that the attention shift will most likely coincide with a multiple of
an octave, due to the spatial closeness of the intervals (See Figure 4.3.2). (A sound that is
perceived an octave lower than another sound has half the frequency of the other sound.)
Therefore, when the frequencies grow in time the listener follows the rise of the sound
until it surpasses the sensitive frequency. Once the sensitive range of perception is passed
the listener will automatically shift his/her attention to sounds with lower frequencies.
Since the amplitude also changes over time, it becomes difficult for the listener to notice
the shifts he/she makes while listening to the Shepard Tone. This is precisely the source
of the illusion.
3. PRELIMINARY INVESTIGATION OF THE SHEPARD TONE 28
Figure 3.1.1. The 48.8’th Second
Figure 3.1.2. The 49.2’th Second
3. PRELIMINARY INVESTIGATION OF THE SHEPARD TONE 29
Figure 3.1.3. The 49.6’th Second
Figure 3.1.4. The 50.0’th Second
3. PRELIMINARY INVESTIGATION OF THE SHEPARD TONE 30
Figure 3.1.5. The 50.4’th Second
Figure 3.1.6. The 50.8’th Second
3. PRELIMINARY INVESTIGATION OF THE SHEPARD TONE 31
Figure 3.2.1. Frequency vs Time
Figure 3.2.2. Log(Frequency) vs Time
4Reconstructing the Shepard Tone
4.1 Method
From the end of the previous chapter, we know that a Shepard Tone is made by superposing
sound threads with different but related frequencies together, each of which, at a fixed
point in time, has a different amplitude. In fact, the term Shepard Tone is sometimes
referred to sound threads that are superposed in such a way that every two neighboring
sound threads are an octave apart. In this project we consider the generalized Shepard
Tone whose spacing between sound threads does not have to be octaves apart, and the
sound threads need not be parallel to each other.
Here we will use Shepard Tone Generator, which we wrote in MATLAB, to recon-
struct the Shepard Tone. The Shepard Tone Generator utilizes the idea of superposing
sound threads together in order to create an audio signal that is illusive. The frequen-
cies of threads will be controlled by an exponential-growth function, and the amplitudes
of threads will be controlled by an envelope function. Schematically, our Shepard Tone
4. RECONSTRUCTING THE SHEPARD TONE 34
Generator produces an audio signal of the form
∑i
(envelopefunction)(t) sin(fi(t)),
where i indexes the sound threads, and we will describe our choice of envelope function as
well as our choices of exponential frequency functions fi(t) below. Notice that a normal
human being can perceive soundwaves with frequencies ranging from 20 Hz to 15000 Hz.
The corresponding Nyquist frequency is 15000 ∗ 2 = 30000. Since 44100 > 30000, we can
thus choose the sampling frequency to be FS = 44100 to avoid aliasing effects.
4.2 Choice of Envelope
The manipulation of the amplitude of each sound thread is crucial to the overall success
of a Shepard Tone. In order to have the listener unaware of their attention shift from
one sound thread to another, we have to increase and then decrease the volume of each
sound thread gradually. We can control the amplitude of each sound thread by applying
an envelope function to it. In the following setups, we use the polynomial
t10(t− 50)6
as our envelope function. (See Fig 4.2.1)
We chose this particular polynomial because it has the following properties: It is zero
at the beginning (t = 0 seconds) and at the end (t = 50 seconds) of our audio file; it has
a faster descent than ascent due to the choice of powers of 6 and 10; and of course it is
smooth.
4.3 Setup 1: Parallel Threads Equally Spaced in Time
Using the harmonics graph (from http://www.lamadeguido.com/Image4.gif) shown below
(Fig 4.3.1) and some simple calculations, which will be demonstrated in the following
4. RECONSTRUCTING THE SHEPARD TONE 35
Figure 4.2.1. Envelope=t10(t− 50)6
sections, we can construct the desired musical intervals by controlling the time interval
between sound threads. In each of the following setups we use f(t) = AeBt, where A =
2, B = 14 , to control the frequencies of the sound threads, and we use curve = sin(f(t)) to
construct sound threads with increasing frequencies. The choices of A and B here are from
observations of Figure 3.2.1, which shows the time-frequency relation of James Tenney’s
‘For Ann’. We estimate the distances between every two neighboring sound threads and
the rate of change in frequency for each thread.
One way of defining intervals is by counting the number of half steps that lie between
the given two notes. For example, two notes that form an octave have twelve half steps
that lie in between.
From the study [4], we know that the Shepard Tone’s illusiveness comes from the lis-
tener’s attention shift. Furthermore, Shimizu’s research claims that shifts of attention
mostly coincide with a multiple of octaves because of the spatial closeness of the octave
tones. Figure 4.3.2 shows the spatial closeness of musical pitches in a helix represen-
4. RECONSTRUCTING THE SHEPARD TONE 36
Figure 4.3.1. Harmonics (from http://www.lamadeguido.com/Image4.gif)
tation (from http://www1.appstate.edu/˜kms/classes/psy3203/MusicIllusions/helix.gif).
We can therefore derive a distance measure which measures the distance of attention shift
that one makes when the pitch on which one concentrates passes the sensitive frequency
range and, as a consequence, the process of searching for notes from the same pitch class
begins.
Suppose a Shepard Tone is composed of sound threads that are octaves apart, i.e., at any
given point in time the frequencies of these threads represent notes from the same pitch
class. Then once a sound thread on which he/she focuses passes the sensitive fequency
range, he/she only needs to shift his/her attention one thread below the original one,
since the new thread will be in the same pitch class as the original note, only an octave
below.
4. RECONSTRUCTING THE SHEPARD TONE 37
Figure 4.3.2. Helix Representation of Pitches(from http://www1.appstate.edu/˜kms/classes/psy3203/MusicIllusions/helix.gif)
We propose a distance measurement that can be applied to all intervals except for
major seconds (M2) and minor seconds (m2). The cases of major and minor seconds will
be discussed separately after the other intervals.
Definition 4.3.1 (Distance Measure). Suppose a particular Shepard Tone is formed by
superposing sound threads in such a way that any two neighboring threads are n half steps
apart, where n ≤ 12. Then the distance of the listener’s attention shift when searching for
the closest note from the same pitch class is
σ(n) =lcm(n, 12)
12
(Note that σ(n) will take values between 1 and 11.) In other words, the new thread
from the same pitch class as the original thread is σ(n) octaves away from the original
one. Therefore, for an ascending Shepard Tone the listener’s attention has to shift σ(n)
octaves downward in order to find a note from the same pitch class as before. 4
4. RECONSTRUCTING THE SHEPARD TONE 38
4.3.1 Octaves
First of all we have to figure out the required time interval between different sound threads
in order to make any two neighboring sound threads an octave apart from each other. From
Figure 4.3.1 we can see that two notes that are an octave apart will have frequency f for
the low note and 2f for the high note. Now, suppose at time t a sound thread has frequency
f , and at a later point of time, t1, the frequency of the same sound thread changes to 2f .
To figure out the delay in time that will make the sound threads octaves apart from each
other, we need to figure out the time difference t1 − t. Therefore, we have to solve the
following system of equations.
f = f(t) = AeBt,
2f = f(t1) = AeBt1 .
Substituting AeBt for f and plugging that into the second equation we get
2AeBt = AeBt1 .
Since A 6= 0, we get
2eBt = eBt1 .
Now applying logarithms to both sides of the equation we get
log 2eBt = log eBt1
Since
log 2eBt = log 2 + log eBt = log 2 +Bt
and
log eBt1 = Bt1.
we then have that
log 2 +Bt = Bt1.
4. RECONSTRUCTING THE SHEPARD TONE 39
Therefore
t1 =log 2B
+ t,
and
t1 − t =log 2B
.
Therefore, the time interval required for the sound threads to be octaves apart is log 2B
seconds, or an integer multiple of log 2B seconds. Plugging in B = 1
4 , we know that the
threads have to be 2.7726 seconds, or an integer multiple of 2.7726 seconds, apart.
Since an octave is formed by two notes that are twelve half steps apart, we can calculate
the listener’s attention shift by plugging in n = 12 to our distance formula.
σ(12) =lcm(12, 12)
12=
1212
= 1.
This calculation shows that the listener only has to shift to the thread one octave below
in order to locate a note from the same pitch class.
4.3.2 Major Sevenths (M7)
We can calculate the time interval between threads in order to make a major seventh
interval. By Figure 4.3.1, we know that to make a major seventh interval we need the two
notes to have frequencies f and 116 f . Note that f and 11
6 f are not the only pair that forms
a major seventh. f and 158 f would work as well. The choice of f and 11
6 f is related to some
musical tuning consideration which we will not go into in detail. In the following discussion
the reader should keep in mind that the given ratios to form the desired intervals might
not be the only choices.
Similar to the octave case, we get
t1 − t =log(11
6 )B
,
and plugging in B = 14 ,
t1 − t ≈ 2.4245.
4. RECONSTRUCTING THE SHEPARD TONE 40
Therefore, the threads have to be 2.4245 seconds, or an integer multiple of 2.4245 seconds,
apart in order to make a major seventh system.
Since a major seventh is formed by two notes that are eleven half steps apart, we can
calculate the listener’s attention shift by plugging in n = 11 to our distance formula
σ(11) =lcm(12, 11)
12=
13212
= 11.
This calculation shows that the listener has to shift to the thread eleven octaves below
in order to locate a note from the same pitch class.
4.3.3 Minor Sevenths (m7)
Similarly, the threads have to be 2.2385 seconds, or an integer multiple of 2.2385 seconds,
apart in order to make a minor seventh system.
Since a minor seventh is formed by two notes that are ten half steps apart, we can
calculate the listener’s attention shift by plugging in n = 10 to our distance formula.
σ(10) =lcm(12, 10)
12=
6012
= 5.
Therefore the listener has to shift to the thread five octaves below in order to locate a
note from the same pitch class.
4.3.4 Major Sixths (M6)
To make a major sixth system the threads have to be 2.0433 seconds, or an integer multiple
of 2.0433 seconds, apart. Since a major sixth is formed by two notes that are nine half
steps apart, we can calculate the listener’s attention shift by plugging in n = 9 to our
distance formula.
σ(9) =lcm(12, 9)
12=
3612
= 3.
Therefore the listener has to shift to the thread four octaves below in order to locate a
note from the same pitch class.
4. RECONSTRUCTING THE SHEPARD TONE 41
4.3.5 Minor Sixths (m6)
To make a minor sixth system the threads have to be 1.8800 seconds, or an integer multiple
of 1.8800 seconds, apart. Since a minor sixth is formed by two notes that are eight half
steps apart, we can calculate the listener’s attention shift by plugging in n = 8 to our
distance formula.
σ(8) =lcm(12, 8)
12=
2412
= 2.
Therefore the listener has to shift to the thread two octaves below in order to locate a
note from the same pitch class.
4.3.6 Perfect Fifths (P5)
To make a perfect fifth system the threads have to be 1.6219 seconds, or an integer multiple
of 1.6219 seconds, apart. Since a perfect fifth is formed by two notes that are seven half
steps apart, we can calculate the listener’s attention shift by plugging in n = 7 to our
distance formula.
σ(7) =lcm(12, 7)
12=
8412
= 7.
Therefore the listener has to shift to the thread seven octaves below in order to locate
a note from the same pitch class.
4.3.7 Augmented Fourths (A4)
To make a perfect fourth system the threads have to be 1.4267 seconds, or an integer
multiple of 1.4267 seconds, apart. Since an augmented fourth is formed by two notes that
are sixth half steps apart, we can calculate the listener’s attention shift by plugging in
n = 6 to our distance formula.
σ(6) =lcm(12, 6)
12=
1212
= 1.
Therefore the listener has to shift to the thread one octave below in order to locate a
note from the same pitch class.
4. RECONSTRUCTING THE SHEPARD TONE 42
4.3.8 Perfect Fourths (P4)
To make a perfect fourth system the threads have to be 1.1507 seconds, or an integer
multiple of 1.1507 seconds, apart. Since a perfect fourth is formed by two notes that are
five half steps apart, we can calculate the listener’s attention shift by plugging in n = 5
to our distance formula.
σ(5) =lcm(12, 5)
12=
6012
= 5.
Therefore the listener has to shift to the thread five octaves below in order to locate a
note from the same pitch class.
4.3.9 Major Thirds (M3)
To make a major third system the threads have to be 0.8926 seconds, or an integer multiple
of 0.8926 seconds, apart. Since a major third is formed by two notes that are four half
steps apart, we can calculate the listener’s attention shift by plugging in n = 4 to our
distance formula.
σ(4) =lcm(12, 4)
12=
1212
= 1.
Therefore the listener has to shift to the thread one octave below in order to locate a
note from the same pitch class.
4.3.10 Minor Thirds(m3)
To make a minor third system the threads have to be 0.7293 seconds, or an integer multiple
of 0.7293 seconds, apart. Since a minor third is formed by two notes that are three half
steps apart, we can calculate the listener’s attention shift by plugging in n = 3 to our
distance formula.
σ(3) =lcm(12, 3)
12=
1212
= 1.
Therefore the listener has to shift to the thread one octave below in order to locate a
note from the same pitch class.
4. RECONSTRUCTING THE SHEPARD TONE 43
4.3.11 Major Seconds (M2)
To make a major second system the threads have to be 0.5344 seconds, or an integer
multiple of 0.5344 seconds, apart. Suppose we were to use the distance formula to calculate
the listener’s attention shift. Then, since a major second is formed by two notes that are
two half steps apart, we can plug in n = 2 to our distance formula.
σ(2) =lcm(12, 2)
12=
1212
= 1.
Therefore the listener has to shift to the thread one octave below in order to locate a
note from the same pitch class.
However, when sound threads are just two half steps, i.e., one whole step, apart, the
distance between sound threads is too small for our ears to tell clearly whether the Shepard
Tone is ascending or not. Therefore, even if we get σ(2) = 1 for Shepard Tones containing
sound threads that are major seconds apart, we cannot conclude that it has the same
degree of illusiveness as other intervals with σ value of 1.
4.3.12 Minor Seconds (m2)
To make a minor second system the threads have to be 0.3480 seconds, or an integer mul-
tiple of 0.3480 seconds, apart. Similar to the major second case discussed above, suppose
we were to use the distance formula to calculate the listener’s attention shift. Then, since
a major second is formed by two notes that are one half step apart, we can plug in n = 1
to our distance formula.
σ(1) =lcm(12, 1)
12=
1212
= 1.
Therefore the listener has to shift to the thread one octave below in order to locate a
note from the same pitch class.
However, when sound threads are just one half step apart, the distance between sound
threads is too small for our ears to tell clearly whether the Shepard Tone is ascending or
4. RECONSTRUCTING THE SHEPARD TONE 44
not. Therefore, even if we get σ(1) = 1 for Shepard Tones containing sound threads that
are minor seconds apart, we cannot conclude that it has the same degree of illusiveness as
other intervals with σ value of 1.
The relationship between intervals and the number of half steps is summarized in Ta-
ble 4.3.1.
4.3.13 Observations
As Table 4.3.2 shows, by comparing the σ(n)’s of different intervals we can form a hierarchy
in terms of attention-shifting distance. We say that intervals that require shorter travel
distance tend to be more illusive because the listener is less likely to be aware of the
attention shift.
4.4 Setup 2: Parallel VS Non-Parallel Threads
From study [4], we know that the illusiveness of a Shepard Tone comes from the inad-
vertent attention shift from one sound thread to another that the listener makes when
concentrating on the Shepard Tone. A Shepard Tone is illusive precisely when the listener
is indeed unaware of this attention shift. In the following discussion, we will use the term
‘frequency distribution’ of the sound threads to refer to the pattern, or more specifically
the relative spacings, of frequency peaks that are present locally in time. These frequency
peaks correspond to the frequencies of the sound threads. (See, for example, Figure 4.4.6.)
We claim that if the sound threads’ frequency distribution hardly varies as time passes,
the listener would be unlikely to notice the attention shift. Alternatively, if there is a large
change in the frequency distribution over time, which the listener would perceive as a
change in the rate of ascent at the attention shift, then the listener would be alerted to
this shift of attention, and consequently would not find such a Shepard Tone illusive. In
order to measure this frequency distribution variation over time, we use some statistical
4. RECONSTRUCTING THE SHEPARD TONE 45
tools to create a measurement of variation. We will then use this measurement to describe
the degrees of illusiveness of different Shepard Tones. In this section we will compare the
degree of illusiveness between Shepard Tones with parallel sound threads and those with
non-parallel sound threads.
For each of the following 21-second-long setups, we randomly pick two points in time
(here we pick the 14.7th and the 16.8th seconds), identify the frequencies of the peaks,
and calculate the ratios of the frequencies of every two neighboring peaks. Then we will
calculate the variance of the logarithm of the ratios. Finally, we will compare the percent-
age change in variance between these two points in time. Note that the smaller the value
of percentage change in variance is, the more constant the frequency spacing between
peaks is. We claim that one setup is less illusive than another if the percentage change
in variance between the two arbitrarily chosen points in time is larger than that for the
other setup. This is because if a setup has a larger percentage change of variance, then
the listener is more likely to notice the change in frequency distribution, which is central
to the illusiveness of the Shepard Tone, and therefore this setup would be less illusive. We
outline the way we calculate the relative change in variance in the following definition.
Definition 4.4.1 (Calculating the Relative Change in Variance). Suppose a particular
Shepard Tone is formed by superposing n sound threads, and, at a particular point of
time t0, the frequencies of these sound threads reach f1, f2, f3, · · · , fn, where f1 ≤ f2 ≤
f3 ≤ · · · ≤ fn. Then we define σ2(t0) in the following way.
σ2(t0) = var(
logf2
f1, log
f3
f2, · · · , log
fnfn−1
)
We can therefore calculate the relative change in variance between two different
points in time t0, t1, with t0 ≤ t1, by
4. RECONSTRUCTING THE SHEPARD TONE 46
relative change in variance =σ2(t1)− σ2(t0)
σ2(t0).
4
Here we will consider three cases. Case 1 consists of parallel sound threads, while Cases
2 and 3 consist of non-parallel sound threads with different parameters.
4.4.1 Case 1: Parallel Sound Threads
In the first case we use Shepard Tone Generator to generate a Shepard Tone of six parallel
sound threads, each of which has parameters A = 2, B = 0.25 and delay = 2.77. We use
the MATLAB command wavwrite to make this Shepard Tone into a .wav file. Then we use
Tone Analyzer to generate Figure 4.4.1, Figure 4.4.2 and Figure 4.4.3. Figure 4.4.1 shows
the time-frequency distribution of this case. Also, by reading off the data from Figure 4.4.2
and 4.4.3 we can calculate the σ2’s of the 14.7th second and the 16.8th second.
σ2(14.7) = var(
log14070
, log270140
, · · · , log43902190
)(4.4.1)
= var(0.6931, 0.6568, 0.7115, 0.6931, 0.6886, 0.6954) (4.4.2)
= 0.3233, (4.4.3)
σ2(16.8) = var(
log12060
, log230120
, · · · , log37001850
)(4.4.4)
= var(0.6931, 0.6506, 0.6931, 0.6931, 0.6986, 0.6931) (4.4.5)
= 0.3222, (4.4.6)
σ2(16.8)− σ2(14.7)σ2(14.7)
=0.3222− 0.3233
0.3233= −0.0034.
4. RECONSTRUCTING THE SHEPARD TONE 47
Figure 4.4.1. Case 1: Frequency vs Time
4.4.2 Case 2: Non-Parallel Sound Threads A
The second case consists of six non-parallel sound threads that, while all having parameters
A = 2 and delay = 2.77, have different B values that are 0.25, 0.15, 0.21, 0.19, 0.17, 0.23,
respectively. Notice the variance of the B values is 0.0014. Similar to Case 1, we use Shepard
Tone Generator and Tone Analyzer to generate relevant graphs. Figure 4.4.4 shows the
time-frequency distribution of this case. By reading off the data from Figure 4.4.5 and
4.4.6, we can calculate the σ2’s of the 14.7th second and the 16.8th second.
4. RECONSTRUCTING THE SHEPARD TONE 48
σ2(14.7) = var(
log9030, log
61090
, · · · , log41501600
)(4.4.7)
= var(1.0986, 1.9136, 0.9643, 0.9531) (4.4.8)
= 0.2106, (4.4.9)
σ2(16.8) = var(
log13040
, log960130
, · · · , log71202630
)(4.4.10)
= var(1.1787, 1.9994, 1.0078, 0.9959) (4.4.11)
= 0.2272, (4.4.12)
σ2(16.8)− σ2(14.7)σ2(14.7)
=0.2272− 0.2106
0.2106= 0.0787.
4.4.3 Case 3: Non-Parallel Sound Threads B
The third case consists of six non-parallel sound threads that, while all having parameters
A = 2 and delay = 2.77, have different B values that are 0.25, 0.08, 0.19, 0.10, 0.24, 0.16,
respectively. Notice the variance of the B values is 0.0050, which is larger than that of Case
2. This means the B values in Case 3 vary more than those in Case 2. Figure 4.4.7 shows
the time-frequency distribution of this case. By reading off the data from Figure 4.4.8 and
4.4.9 we can calculate the σ2’s of the 14.7th second and the 16.8th second.
σ2(14.7) = var(
log25060
, log2810250
, log45302810
)(4.4.13)
= var(1.4217, 2.4195, 0.4775) (4.4.14)
= 0.9429, (4.4.15)
4. RECONSTRUCTING THE SHEPARD TONE 49
σ2(16.8) = var(
log37080
, log4590370
, log75604590
)(4.4.16)
= var(1.5315, 2.5181, 0.4990) (4.4.17)
= 1.0194, (4.4.18)
σ2(16.8)− σ2(14.7)σ2(14.7)
=1.0194− 0.9429
0.9429= 0.0811.
4.4.4 Observations
By comparing the relative change in variance in the above three cases, we realize that in
Case 1, where all sound threads are parallel, the relative change in variance is −0.0034,
Case 2: 0.0787, and Case 3: 0.0811. The small value of relative change in variance in Case 1
indicates that the spacing between sound threads are fairly constant. This means that since
the frequency spacing between sound threads hardly change, it is difficult for the listener
to notice any attention shift when listening to the sound file. On the contrary, since Case
2 and Case 3 have a comparatively large relative change of variance, the listener is more
likely to notice the attention shift when focusing on those two Shepard Tones, and thus
find both Case 2 and 3 less illusive. Furthermore, since the relative change in variance
in Case 3 is larger than that in Case 2, Case 3 is even less illusive than Case 2. This
conclusion can be confirmed by checking the variances of both cases’ B Values. (The B
values in Case 3 have a larger variance than those in Case 2.)
Therefore we conclude that Shepard Tones with parallel sound threads are more illusive
than those with non-parallel sound threads, as would be expected. We can further conclude
that Shepard Tones whose B values have larger variances are less illusive than those whose
B values have smaller variance.
4. RECONSTRUCTING THE SHEPARD TONE 50
Interval Number of Half StepsOctave 12
Major Seventh 11Minor Seventh 10Major Sixth 9Minor Sixth 8Perfect Fifth 7
Augmented Fourth 6Perfect Fourth 5Major Third 4Minor Third 3
Major Second 2Minor Second 1
Table 4.3.1. Intervals and their corresponding number of half steps
σ 1 2 3 5 7 11Octave m6 M6 m7 P5 M7
A4 P4M3m3
(M2)(m2)
Table 4.3.2. Intervals and corresponding σ values
Interval Seconds ApartOctave 2.7726
Major Seventh 2.4245Minor Seventh 2.2385Major Sixth 2.0433Minor Sixth 1.8800Perfect Fifth 1.6219
Augmented Fourth 1.4267Perfect Fourth 1.1507Major Third 0.8926Minor Third 0.7293
Major Second 0.5344Minor Second 0.3480
Table 4.3.3. Intervals and their corresponding space in time when B = 14
4. RECONSTRUCTING THE SHEPARD TONE 51
Figure 4.4.2. Case 1: 14.7 second
Figure 4.4.3. Case 1: 16.8 second
4. RECONSTRUCTING THE SHEPARD TONE 53
Figure 4.4.5. Case 2: 14.7 second
Figure 4.4.6. Case 2: 16.8 second
4. RECONSTRUCTING THE SHEPARD TONE 55
Figure 4.4.8. Case 3: 14.7 second
Figure 4.4.9. Case 3: 16.8 second
5Conclusion
In this project we investigated some properties of the Shepard Tone via a time-frequency
analysis on the Shepard Tone.
We learned that a Shepard Tone is illusive because it makes the listener unaware of
the attention shift that is made while listening to it. Whenever the frequency of the
sound thread on which one concentrates passes the sensitive range, typically between 500
Hz and 5000 Hz, the listener’s attention will automatically shift back to another sound
thread that lies within the sensitive range. A well designed Shepard Tone takes advantage
of these attention shifts made by the listener in such a way that the listener is unaware of
these shifts. This is why the Shepard Tone is mistakenly heard as infinitely ascending.
We wrote Tone Analyzer in MATLAB to perform a time-frequency analysis on the
Shepard Tone. Next, using information from this analysis, we wrote another piece of
MATLAB code, which we called Shepard Tone Generator, to construct our own Shepard
Tones. In this way, we were able to experiment with different settings, and found through
this that the degree of illusiveness of a Shepard Tone is decided by the following factors.
5. CONCLUSION 57
• Choice of envelope function: The envelope function sets the amplitude of each sound
thread over time. A good envelope function should be smooth and have a fast ascent
followed by a fast descent. In our investigation we chose the polynomial function
t10(t− 50)6 to be our envelope function.
• Spacing between sound threads: The spacing between sound threads partially de-
termines the degree of illusiveness of a Shepard Tone. We designed a distance mea-
surement σ which, along with our own perception, allowed us to conclude that the
spacings between sound threads that resulted in the most illusive Shepard Tones are
intervals with σ value of 1, which are octaves, augmented fourths, major thirds and
minor thirds. Intervals with larger σ values lead to less illusive Shepard Tones.
• Parallel or non-parallel sound threads: Based on our claim that the larger the vari-
ation of frequency distribution of a Shepard Tone is, the less illusive this Shepard
Tone would be, we defined a measurement of variation which calculates the relative
change in variance of a Shepard Tone, and concluded from this, and our own auditory
perception, that parallel sound threads create a higher degree of illusiveness than
non-parallel sound threads. We further concluded that non-parallel sound threads
with a larger relative change in variance appear to be less illusive than those with a
smaller relative change in variance.
6Future Work
Since the envelope function of the Shepard Tone partially determines its degree of illusive-
ness, it is worth investigating the ideal type of envelope function. In this project we focused
on polynomial functions since they are smooth and can be easily chosen to have a fast
ascent followed by a fast descent. It would be worthwhile to further explore if variations
on our choice of polynomial function could improve the degree of illusiveness, or perhaps
to search for other kinds of functions that could serve as better envelope functions than
polynomial functions.
It would also be worthwhile to investigate whether the choice of the sound threads’
parameters, A and B, could potentially affect the degree of illusiveness of a Shepard Tone.
As mentioned in the introduction, there are a lot of other audio illusions that are being
investigated, mainly by scholars from the field of psychology. It would be interesting to
extend our time-frequency analysis of the Shepard Tone to those illusions and explore
properties they possess.
Appendix ATone Analyzer
L=length(x);M=0.1*fs; % raw window length = the number of samples in 0.1 sec
%(captures frequencies down to 10 Hz)M=2*round(M/2); % good to have even number of samples in windowN=M; % how many points to sample the Fourier transform at
%(also good to have this even)H=round(M/10); % H=hopsize is such that 10 consecutive windows overlapnframes=floor((L-M)/H); % the number of frames that fit
T=(0.5*M+H*[0:1:nframes-1])/fs; % timegrid formed by midpoint of each frameF=fs*[0:N/2-1]/N; % frequency grid for calculated Fourier transform samples
W=0.5*(1-cos(2*pi*[0:M-1]/M))’; % raised cosine window
output=zeros(N,nframes);
xoff=0;for m=1:nframes
xt=x(xoff+1:xoff+M); % extract data from x with% length M, which equals window length.
xtw=W.*xt; % apply window function W to xt.output(:,m)=fft(xtw,N); % perform fft at N pointsxoff=xoff+H;
end
y=abs(output);
y=y(1:N/2,:); % extract only positive frequencies
Appendix BShepard Tone Generator
FS=44100; % sampling frequencysec=50; % length of signal in seconds.t=[0:1/FS:sec]; % number of entries = FS*sec+1(this is time 0)numOfEntries = FS*sec+1;
A=2;B=1/4;
curve=zeros(1,numOfEntries); %allocate the curve vectorout=zeros(1,numOfEntries);
% if B=1/4 then% octave=log(2)/B = 2.7726% M7=log(11/6)/B = 2.4245% m7=log(7/4)/B = 2.2385% M6=log(5/3)/B = 2.0433% m6=log(8/5)/B = 1.8800% p5=log(3/2)/B = 1.6219% A4=log(10/7)/B =1.4267% p4=log(4/3)/B = 1.1507% M3=log(5/4)/B = 0.8926% m3=log(6/5)/B = 0.7293% M2=log(8/7)/B = 0.5344% m2=log(12/11)/B = 0.3480delay=2.4245;numOfThreads=15;envelope=power(t,10)*power((t-sec),6);
for k=0:numOfThreads-1curve=A*exp(B*(t-k*delay));
APPENDIX B. SHEPARD TONE GENERATOR 61
envelope=[zeros(1,floor(delay*FS)),envelope(1:numOfEntries-floor(delay*FS))];out=out+envelope.*sin(curve);
end
%play soundhalfTime=floor(numOfEntries/2);excerpt=out(:,(1)*halfTime:numOfEntries-1);soundsc(excerpt,FS); % defalut rate is 8192. Here makes it FS instead.
Bibliography
[1] Glyn James, Advanced Modern Engineering Mathematics, Prentice Hall, Pearson Ed-ucation, 2004.
[2] Stephane Mallat, A Wavelet Tour of Signal Processing, Academic Press, New York,1999.
[3] David C. Lay, Linear Algebra and Its Applications, Addison Wesley, Pearson Educa-tion, 2006.
[4] Yu Shimizu, Neuronal response to Shepard’s tones. An auditory fMRI study using mul-tifractal analysis, Brain Research 1186 (2007), 113–123.