DIFFUSION-BASED MUSIC ANALYSIS: A NON-LINEAR APPROACH … › ~gsell › pubs ›...

DIFFUSION-BASED MUSIC ANALYSIS:

A NON-LINEAR APPROACH FOR VISUALIZATION AND

INTERPRETATION OF THE GEOMETRY OF MUSIC

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF MUSIC

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Gregory Kennedy Sell

October 2010

Abstract

Diffusion mapping is a non-linear data analysis method based off a model of the data

as states in a random walk. Through this approach, the global structure of the data

is built up from local connectivity rather than pure distance. This diffusion-based ap-

proach is advantageous because, by using only local connectivity, it is still robust and

meaningful in high dimensional spaces, unlike Euclidean distance, without requiring

any assumptions about the structure of the data. Also, the diffusion mapping format

leads directly into meaningful low-dimensional spaces for visualization of the data’s

structure.

In this dissertation, I will examine the effectiveness of diffusion mapping as a tool

for analysis and visualization of music theory and, through these demonstrations,

make an argument for its potential in the field. Diffusion has never been applied to

music at this level before, nor has it been used analytically at a comparable level in

any other field. It will be shown that the approach is not only capable of organiz-

ing and visualizing music, but also, through those organizations and visualizations,

communicating the underlying music theory used in creating the data sets.

First, I will show that notes within a diffusion space plot the fundamental geo-

metrical shape underlying the intervals of diatonic music, using only those intervals

themselves as input to the system. Furthermore, by combining two or three of these

simple intervals, the diffusion space can easily recreate historically significant musical

visualizations. In both of these cases, the diffusion process requires very basic input

and then automatically organizes the notes into a meaningful and insightful visualiza-

tion. This same process can be applied to temporal events, automatically extracting

the complimentary geometry to the patterns of meter and hemiolas.

iv

Diffusion geometry can also be used to organize and group music based on musical

characteristics. Specifically, this will be demonstrated in several key-based organiza-

tions. To accomplish this, I will also perform clustering in the diffusion space. As a

part of this process, I will propose a new and novel metric in diffusion space called the

diffusion time constant τ . Due to the flexibility of diffusion, key-based organization

can be performed from both distributional and functional approaches, and, in the

distributional case, the performance of the Krumhansl-Schmuckler algorithm can be

improved significantly on a commonly used test corpus, Bach’s The Well-Tempered

Clavier, Books 1 and 2.

Musical excerpts will also be visualized as trajectories in the musical space, po-

sitioning the notes of the scale structurally based on the musical relationships in

the score itself. By animating this visual to follow the trajectory through time, a

new musical analysis and experience tool is introduced. Elements such as harmonic

rhythm, harmonic movement, and repetitive sections can easily be perceived in this

space. This visualization, it will also be shown, is largely robust to temporal vari-

ations and musical errors, plotting versions of the same melody in a recognizably

similar structure.

It will also be demonstrated that, though the majority of the work presented uses

exclusively symbolic representations, the same principles and tools can be applied to

audio signals by layering multiple diffusion maps.

Throughout this process, the applicability of machine learning methods in music

analysis to diffusion space will be examined, in the context of key finding and meter

induction in particular.

This work makes novel contributions to both the fields of diffusion mapping and

computational music analysis. In the diffusion domain, this dissertation firstly offers

a comprehensive guide to the diffusion algorithm designed for a reader with only

moderate mathematical background. Additionally, the diffusion time constant and

the subsequent hierarchical clustering in diffusion space are both new extensions to

diffusion mapping, and it will be shown that the metric is meaningful for both musical

analysis and parsing of arbitrary data sets.

In the music domain, this dissertation contributes a completely new analysis

v

method. By treating music as data points, musical sets can be organized based on

attributes such as key. And the visualization capabilities of diffusion also have a great

deal of potential in music analysis as a means for understanding musical relationships

and for interacting with musical structure in a new way.

vi

Acknowledgments

Over the course of a full undergraduate and graduate career in one school, there have

been far too many friendships and collaborations to fully recount them all here. But,

some have been so remarkable and appreciated that they must be properly recognized

before beginning the work that follows.

Any discussion of my academic career must begin with my advisor, Jonathan

Berger. Unbelievably, Jonathan taught my first class at CCRMA, all the way back

in my sophomore year, the class that convinced me I belonged at CCRMA. Jonathan

has always been an advisor in the fullest sense, willingly extending his duties well

beyond the classroom and research lab into life and its many decisions. Without

Jonathan Berger, my life would be impossibly different, and for his role in helping

me find my way to this milestone, I will always be grateful.

Chris Chafe and Ge Wang, who, along with Jonathan, make my committee,

also have my deepest gratitude. They were extraordinarily accommodating to the

challenges and complications involved in finishing a PhD from the opposite side of

the country. They offered wonderful support and inspiring insights into my work.

Jonathan Abel and Paul DeMarinis, who served as the additional members of my

defense committee, were also fantastically accommodating to the hectic scheduling

I forced upon them. Having such a wonderfully creative and supportive committee

made this whole process a true pleasure.

Professor Ronald Coifman from Yale University also merits special recognition.

His guidance through the world of diffusion has been both patient and masterful.

His creativity for new ideas and approaches combined with a remarkably brilliant

mathematical mind is a truly unique combination and it has been the highest honor

vii

to spend time working with him. Without his help and guidance, this dissertation

simply would not exist.

I would also like to recognize the many others in the CCRMA community and

beyond who have helped me throughout my academic career. Malcolm Slaney, Les

Atlas, and Julius Smith have provided great guidance in both research and life. And

I always enjoyed and will miss the many conversations, both research-related and

diversionary, with my fellow students Kyogu Lee, Gautham Mysore, Ed Berdahl,

David Yeh, Nelson Lee, and the rest of the DSP group. CCRMA has been my home

for many years, and I am sad to leave its welcoming faces and beautiful views.

Of course, I must mention my family. They supported me long before this research

began, and they will be there long after. I can only hope that this work is able to

honor the sacrifices they made for me.

And finally, the biggest thanks of all goes to my wife Tara. She spent every day in

the trenches with me, and her support and encouragement kept me pushing forward.

Without her in my life, I am sure this dissertation would have taken my mind long

ago.

viii

Contents

Abstract iv

Acknowledgments vii

1 Overview 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2.1 Assumption-free Music Analysis . . . . . . . . . . . . . . . . . 2

1.2.2 Computational Music Theory Analysis . . . . . . . . . . . . . 2

1.2.3 Musical Visualization . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Organization of Content . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 5

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Key Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Krumhansl-Schmuckler key-finding algorithm . . . . . . . . . 6

2.2.2 Other key-finding algorithms . . . . . . . . . . . . . . . . . . . 12

2.3 Meter Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.2 Other meter induction methods . . . . . . . . . . . . . . . . . 14

2.4 Musical Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.1 Note or Pitch-Class Visualizations . . . . . . . . . . . . . . . . 15

2.4.2 Key Visualizations . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.3 Other visualizations . . . . . . . . . . . . . . . . . . . . . . . . 21

ix

2.5 Machine Learning in Music Analysis . . . . . . . . . . . . . . . . . . 21

2.5.1 Unsupervised . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.2 Supervised . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Diffusion Maps 26

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Diffusion Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.2 Affinity function . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.3 Affinity-derived Markov matrices . . . . . . . . . . . . . . . . 29

3.2.4 Eigenvectors of a Markov matrix . . . . . . . . . . . . . . . . 36

3.2.5 Eigenvalue decay and the meaning of the eigenvectors . . . . . 37

3.2.6 Diffusion maps and diffusion distance . . . . . . . . . . . . . . 39

3.2.7 Diffusion time constant . . . . . . . . . . . . . . . . . . . . . . 42

3.2.8 Hierarchical clustering with the diffusion time constant . . . . 47

3.2.9 Comparison to other methods . . . . . . . . . . . . . . . . . . 48

3.3 Applying diffusion distance to music analysis . . . . . . . . . . . . . . 54

4 Diffusion-based Music Theory Analysis 57

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 Tonal Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2.1 Input Representation . . . . . . . . . . . . . . . . . . . . . . . 58

4.2.2 Affinity Function . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2.3 Geometric representations of pitch-class intervals . . . . . . . 59

4.2.4 Recreating note-based visualizations . . . . . . . . . . . . . . 72

4.3 Metrical Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.3.1 Metric geometry . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.3.2 Visualizing hemiolas . . . . . . . . . . . . . . . . . . . . . . . 85

4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

x

5 Diffusion-based Musical Applications 88

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.2 Key Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.2.1 Key-Finding Characteristics from the Diffusion Time Constant 89

5.2.2 Functional Code-Based Key Organization . . . . . . . . . . . . 94

5.2.3 Extending the K-S Algorithm with Clustering . . . . . . . . . 99

5.3 Meter Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.4 Visualization of Trajectories . . . . . . . . . . . . . . . . . . . . . . . 104

5.4.1 Twinkle, Twinkle, Little Star . . . . . . . . . . . . . . . . . . 106

5.4.2 Prelude No. 1 in C major (BWV 846) from The Well-Tempered

Clavier, Book 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.4.3 Robustness to Performance Noise . . . . . . . . . . . . . . . . 113

5.4.4 Audio Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6 Conclusions and Future Work 119

6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 119

6.1.1 Diffusion Time Constant . . . . . . . . . . . . . . . . . . . . . 119

6.1.2 Assumption-Free Music Analysis . . . . . . . . . . . . . . . . 120

6.1.3 Musical Visualizations . . . . . . . . . . . . . . . . . . . . . . 121

6.2 Future Work and Extensions . . . . . . . . . . . . . . . . . . . . . . . 121

6.2.1 Audio Signal-based Analysis . . . . . . . . . . . . . . . . . . . 121

6.2.2 Improved Visualization Platform . . . . . . . . . . . . . . . . 122

6.2.3 Comparison of Diffusion Spaces . . . . . . . . . . . . . . . . . 123

6.2.4 Implications for Non-Tonal Western or Non-Western Music . . 124

6.2.5 Inverting Diffusion Space to Audio . . . . . . . . . . . . . . . 125

6.2.6 Examination of Less Prominent Dimensions of Map . . . . . . 126

6.2.7 Dual Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

Bibliography 129

xi

List of Tables

4.1 The pitch-class intervals and their inversions . . . . . . . . . . . . . . 60

5.1 Accuracy for various interval-based key-finding approaches using near-

est neighbors in the diffusion time constant. . . . . . . . . . . . . . . 95

5.2 Accuracy for the K-S key-finding algorithm before and after process-

ing the data with a filter derived from hierarchical clustering in the

diffusion space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.3 Accuracy for the meter induction task using nearest neighbors with

both Euclidean distance and the diffusion time constant. . . . . . . . 102

xii

List of Figures

2.1 The Krumhansl and Kessler key profiles for major (top) and minor

(bottom) keys, from [76] . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 The Tonnetz, harmonic network, or table of tonal relatives (from [29]),

including the visualizations of parallel (P), leading-tone (L), and rela-

tive (R) triad relationships from Neo-Riemannian theory. . . . . . . . 16

2.3 Several geometric representations proposed by Shepard, from [45]. . . 17

2.4 The Spiral Array, from [13] . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 Schoenberg’s spatial mapping of keys (from [45]), with the key region

for C major marked . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.6 The MDS derived mapping of the keys from Krumhansl and Kessler’s

key-profiles, and its 2D mapping from the angles of the circles, from [46]. 19

2.7 Two harmonic visualizations of Mozart’s Sonatina No. 1 in C major

K439B (Viennese), 1st Movement, from [62]. . . . . . . . . . . . . . . 20

3.1 Two example data sets that will be used to demonstrate the process

of calculating diffusion distance. In both cases, color indicates the

distribution from which a sample was drawn . . . . . . . . . . . . . . 28

3.2 The probabilities of a random walk concluding in each data point (with

high probability shown in lighter color) at different time scales for the

cluster-based data set. The columns show the case for three different

starting points, shown as a red dot in each. . . . . . . . . . . . . . . . 31

xiii

3.3 The probabilities of a random walk concluding in each data point (with

high probability shown in lighter color) at different time scales for the

circle-based data set. The columns show the case for three different

starting points, shown as a red dot in each. . . . . . . . . . . . . . . . 32

3.4 The probability matrix P t for the cluster data set (Fig. 3.1(a)) at

several values of t. Axis labels correspond to the cluster numbering

from Fig. 3.1(a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5 Diffusion distance matrices Dt for different values of t for the cluster

data set, where dark color means a short distance. . . . . . . . . . . . 41

3.6 Diffusion distance matrices Dt for different values of t for the circle

data set, where dark color means a short distance. . . . . . . . . . . . 42

3.7 The circle data set plotted in φ1, φ2, and φ3. . . . . . . . . . . . . . . 43

3.8 The diffusion time constant matrices for the example data sets, demon-

strating that data points within structural elements have small time

constants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.9 The Euclidean distances for the circle data set, where the data points

within the same circle are not close together. . . . . . . . . . . . . . . 47

3.10 Hierarchical trees for both data sets from the diffusion time constants,

showing that structural elements at multiple levels are accurately ex-

tracted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.1 The pitch classes connected by semitone intervals plotted in the first

two dimensions of the diffusion map. . . . . . . . . . . . . . . . . . . 61

4.2 The pitch classes connected by major 2nd intervals plotted in the first

three dimensions of the diffusion map. . . . . . . . . . . . . . . . . . 62

4.3 The pitch classes connected by major 2nd intervals plotted in dimen-

sions 2, 3, and 4 of the diffusion map. . . . . . . . . . . . . . . . . . . 63

4.4 The pitch classes connected by minor 3rd intervals plotted in the first


4.5 The pitch classes connected by minor 3rd intervals plotted in dimen-


xiv

4.6 The pitch classes connected by minor 3rd intervals plotted in dimen-

sions 3, 4, and 5 of the diffusion map, viewed from a different angle

than Fig. 4.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.7 The pitch classes connected by major 3rd intervals plotted in the first


4.8 The pitch classes connected by major 3rd intervals plotted in dimen-


4.9 The pitch classes connected by perfect 4th intervals plotted in the first

two dimensions of the diffusion map. . . . . . . . . . . . . . . . . . . 69

4.10 The pitch classes connected by tritone intervals plotted in the first


4.11 Several geometric representations of intervals appear in the diffusion

space created with the major chord. . . . . . . . . . . . . . . . . . . . 71

4.12 Shepard’s chromatic helix in diffusion space, resulting from the com-

bination of minor 2nd and octave intervals with the full note set, with

and without the minor 2nd intervals drawn in. . . . . . . . . . . . . . 73

4.13 Zooming in on two octaves of the chromatic helix from Fig. 4.12(b). . 74

4.14 Shepard’s double helix in diffusion space, resulting from the combina-

tion of perfect 5th and octave intervals with the full note set, with and

without the major 2nd intervals drawn in. . . . . . . . . . . . . . . . . 76

4.15 Zooming in on two octaves of the double helix from Fig. 4.14(b). . . . 77

4.16 Several other interpretations of the note organization in which Shep-

ard’s double helix exists. . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.17 The diffusion space created with minor 2nd and two octave plus major

3rd intervals with an approximation of the Spiral Array represented by

the lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.18 The Krumhansl-Kessler key space from Fig. 2.6(a) remade with diffu-

sion. Dots represent major keys and circles represent minor keys. . . 81

4.19 Duple-meter beat trains separated completely from triple meter beat

trains and organized into a square. . . . . . . . . . . . . . . . . . . . 82

xv

4.20 Triple-meter beat trains separated completely from duple meter beat

trains and organized into a triangle. . . . . . . . . . . . . . . . . . . . 83

4.21 Hemiolas based in units of 2 shaped into a square, similarly to the

metric case shown in Fig. 4.19. . . . . . . . . . . . . . . . . . . . . . 86

4.22 Hemiolas based in units of 3 shaped into a triangle, similarly to the

metric case shown in Fig. 4.20. . . . . . . . . . . . . . . . . . . . . . 86

5.1 Diffusion time constants between the pitch classes for major and minor

keys. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.2 Diffusion time constants between notes separated by various intervals

for the major subset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.3 Diffusion time constants between notes separated by various intervals

for the minor subset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.4 Key profiles derived from the diffusion time constant compared to the

K-K key profiles for major (top) and minor (bottom) keys. . . . . . . 93

5.5 Confusions for all 6 functional key-finding experiments with notewor-

thy confusions labeled. . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.6 Diffusion maps for all 6 functional key-finding experiments. . . . . . . 97

5.7 The hierarchical tree created from pitch-class distributions labeled with

key with errors circled in red. . . . . . . . . . . . . . . . . . . . . . . 100

5.8 The first three dimensions of the diffusoin map for meter classification

on the Essen folksong database colored by meter label. . . . . . . . . 103

5.9 The same as Fig. 5.3 with the test data indicated by larger size, and

errors in labeling of the test data largest. . . . . . . . . . . . . . . . . 104

5.10 The melody of Twinkle, Twinkle, Little Star . . . . . . . . . . . . . . 106

5.11 The trajectory for Twinkle, Twinkle, Little Star with the individual

notes marked. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.12 Score of Bach’s Prelude No. 1 in C major (BWV 846) from The Well-

Tempered Clavier, Book 1. . . . . . . . . . . . . . . . . . . . . . . . . 109

5.13 The trajectory for Bach’s Prelude No. 1 in C major (BWV 846) from

The Well-Tempered Clavier, Book 1. . . . . . . . . . . . . . . . . . . 110

xvi

5.14 The trajectory for Bach’s Prelude No. 1 in C major (BWV 846) from

The Well-Tempered Clavier, Book 1, with only the first four measures

shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.15 Several trajectories for Bach’s Prelude No. 1 in C major (BWV

846) from The Well-Tempered Clavier, Book, with different levels of

performance-like noise synthetically added . . . . . . . . . . . . . . . 114

5.16 The trajectory for Twinkle, Twinkle, Little Star, derived here from an

audio signal instead of symbolic music (as was the case in Fig. 5.11). 116

xvii

Chapter 1

Overview

1.1 Introduction

This dissertation presents and develops a non-linear data analysis method called dif-

fusion mapping and examines its applicability to music analysis. Diffusion analysis

has been used for dimensional reduction, extraction of structure, classification, orga-

nization, and visualization of various data sets. Each of these assets will be examined

in the context of music analysis.

This chapter serves to contextualize the work that follows by discussing the moti-

vation for such research. The organization of the remaining content of the dissertation

is also briefly summarized.

1.2 Motivation

The central goal of much of computational music research is to teach computers to

listen to music, and to simulate how humans might process and interpret it. This

would not only be an impressive accomplishment and a significant contribution to

perceptual computation, it would also allow for automating many complicated musical

tasks, including musical similarity and recommendation. Significant progress has been

made, but there is still a great deal of work to be done. Furthermore, automatic

processes are not the only application for computational music understanding. It

1

CHAPTER 1. OVERVIEW 2

could provide a valuable tool in computer-assisted musical teaching and analysis, not

to mention interactive search and organization.

The work presented in this dissertation focuses on advancing research in com-

putational ‘understanding’ of music. The work is primarily motivated by need for

computational improvements in analysis and visualization of basic elements of mu-

sic theory, and the work is especially significant because it is based entirely on an

assumption-free framework.

1.2.1 Assumption-free Music Analysis

One of the most significant aspects of the diffusion-based music analyses to be pre-

sented is that they require no prior assumptions about the incoming musical signals.

This is a valuable characteristic, and has largely been missing from other computa-

tional music analyses. Typically, musical assumptions are manually hardwired into

the system or estimated from large databases of training music.

However, in the work to follow, brief and individual excerpts of music can be

analyzed without any assumptions at all, beyond the fundamental assumption that

notes are different, and so the musical relationships can be extracted on as fine a

level as is desired for the application or task. This sort of system has not been tested

or examined before, and so creating that sort of contribution was one of the main

motivations behind the work.

1.2.2 Computational Music Theory Analysis

One key aspect of the perceptual musical listening experience is the concept of music

theory, the set of rules and guidelines that determine the harmonic and temporal

relationships in music. Musical expectation, in which a listener anticipates the pro-

gression of music, and its fulfillment or violation, are all products of music theory,

and most listeners have an underlying concept of music theory, regardless of formal

training or experience.

However, little work has been done in computationally extracting and analyzing

music theory on a fundamental level. Most research has focused on higher-level music


theory concepts like key, meter, genre, and mood, because those have more direct

applications in classification. However, all of these concepts are built on more basic

and atomic concepts such as pitch interval, and little work has been directed at this

level. Building a computational understanding of music theory from the ground up

requires starting from these atomic units, and demonstrating a process that fills in

this groundwork is another core motivation of this work.

1.2.3 Musical Visualization

Musical visualization is a particularly interesting research area. Its wide range of

implementations and applications seem limited only by the imagination. Associating

visual projections with musical signals can provide, at the very least, a dynamic

multimedia experience. Ideally, visualizations offer a new way to analyze and parse

the music, and through this new approach provide a pathway to deeper understanding.

As will be discussed, diffusion mapping is ideal for creating low-dimensional spaces

for visualization of data or, in this case, music. So, at the core of the motivation for

this work is the fact that, while creating a system for assumption-free computational

analysis of music theory, an approach for automatic visualization is developed as well.

Visualizations will be created toward many goals, ranging from geometric represen-

tations of intervals to mapping musical excerpts to trajectories in diffusion space,

and through these demonstrations, the role of visualizations will be explored in the

context of a new and automatic visualization tool.

1.3 Organization of Content

Following this introduction, this dissertation will begin with a thorough review of

previous work in several relevant fields to this work. Key finding, meter induction,

visualization, and machine learning will all be addressed.

This is followed by a thorough and complete introduction to diffusion theory.

While all of the work in this dissertation is geared toward musical applications, the

descriptions of diffusion analysis are universally applicable. It is hoped that this


chapter will serve as a reference for others with new ideas for applications of diffusion

mapping beyond those approaches presented here.

We will then dive into the world of musical applications of diffusion mapping.

First, the experiments will focus on the most basic aspects of music theory, examining

the geometric visualizations of intervals and short rhythmic patterns. In this context,

it is also shown that many past visualizations of musical space are specific examples

of the type of analysis performed in this diffusion space.

The next level of theoretical analysis, focusing on key and meter, is analyzed in

the context of both historical methods and also assumption-free code-based methods.

At this level, a process for creating musical trajectories in a diffusion space based on

the musical relationships is introduced and analyzed for a few examples.

Finally, accompanying a conclusion, a series of directions for future work are

proposed. Because the work presented in this dissertation primarily deals with new

research fields like ground-up computational music theory analysis and assumption-

free theory extraction as well as highly interpretable and customizable fields like

musical visualization, there is a great deal of future work to be considered.

Chapter 2

Background

2.1 Introduction

This work introduces a new analytic system to the world of computational music

analysis, and so it will necessarily involve many fields. To follow is a review of research

in key finding, meter induction, musical visualization, and machine learning in music

analysis. The potential for diffusion geometry in computational music analysis and

music information retrieval extends far beyond these applications, but these are the

areas on which the work presented here will focus. While an attempt was made for

broad coverage within the fields covered, this should by no means be considered a full

review of computational music analysis, music information retrieval, or music theory.

2.2 Key Finding

The key of a musical piece refers specifically to which note is the tonal center, but it

also establishes roles for all 12 of the pitch classes. A key can be in major or minor

mode, each of which defining a different set of expectations for the tonal or harmonic

progressions and cadences. While determining the key is often intuitive for a human

listener, creating a similarly effective computational model has not yet been fully

achieved.

In the field of key finding for symbolic music, the largest shadow is undoubtedly

5

CHAPTER 2. BACKGROUND 6

cast by the Krumhansl-Schmuckler (K-S) algorithm. Not only is the approach used

extensively throughout the field, but many subsequent algorithms can easily be seen

as extensions of the original principles, populating a field called distributional key

finding. Others have proposed many unique algorithms, largely called structural key-

finding algorithms, but the K-S algorithm and the continuations of that work are the

most prominent in the field.

2.2.1 Krumhansl-Schmuckler key-finding algorithm

Initially, Krumhansl and Kessler [46] derived a set of key profiles (also sometimes

called tonal hierarchies, or K-K profiles), seen in Fig. 2.1, from the results of percep-

tual tests. In the tests, listeners were asked to rate how well a certain tone follows

a musical sequence, such as a scale, chord, or cadence. Essentially, the study aimed

to find how well notes of the scale perceptually fit in with musical elements designed

to establish a key. By averaging the results across users and transposing results to

a common key (under the reasonable assumption that the study was unaffected by

transposition), they derived a major key profile and a minor key profile.

The K-S key-finding algorithm [43] correlates these key profiles with an input vec-

tor created from the total duration time of each of the 12 pitch classes. By correlating

with each of the 12 possible shifted orientations of the key profiles, the key of the

excerpt can be estimated as the key with highest correlation.

Key Estimate = argmaxn∈{0,1,...,11}

∑m

(kn(m)− kn)(x(m)− x)√∑m

(kn(m)− kn)2(x(m)− x)2(2.1)

In this equation kn is a key profile shifted by n, and so kn(m) is the score for the

mth pitch class in the key profile. kn is the mean of the key profile. x is the input

vector, and so, for a given musical excerpt, x(m) quantifies the amount of time that

the mth note is playing. x is the mean of the input vector. Note that the correlation

is calculated for both major and minor key profiles.


68 David Temperley

Fig. 1. The Krumhansl-Kessler key profiles, for C major (above) and C minor (below). (Data from Krumhansl, 1990, p. 80.)

in the key profile means that the corresponding pitch class was judged to fit well with that key. Each of the 24 major and minor keys has its own key profile. The key profiles for C major and C minor are shown in Figure 1; other profiles are generated simply by shifting the values around by the

appropriate number of steps. For example, whereas the C-major vector has a value of 6.35 for C and a value of 2.23 for Ci, Cft major would have a value of 6.35 for Ci and a value of 2.23 for D.2 As Krumhansl (1990, p. 29) notes, the key profiles reflect well-established musical principles. In both

major and minor, the tonic position (C in the case of C major/minor) has the highest value, followed by the other two degrees of the tonic triad (E and G in C major, Et and G in C minor); the other four degrees of the diatonic scale are next (D, F, A, and B in C major; D, F, At, and Bt in C minor - assuming the natural minor scale), followed by the five chromatic scale steps.

The algorithm judges the key of a piece by correlating each key profile with the "input vector" of the piece (Krumhansl, 1990, pp. 78-80). The

2. The original data were gathered for a variety of keys, but there was little variation between major keys (after adjusting for transposition), so the data were averaged over all

major keys to produce a major-key profile that was then used for all major keys; the same was done for minor keys (Krumhansl, 1990, pp. 25, 27).

Figure 2.1: The Krumhansl and Kessler key profiles for major (top) and minor (bot-tom) keys, from [76]


This equation is a standard correlation, with mean and variance removed so that

the major and minor key scores can be compared. Temperley [76] suggested elim-

inating the denominator from this equation for easier calculation, though with the

Krumhansl-Kessler key profiles, this illegitimizes comparisons between major and

minor correlation scores1.

The K-S algorithm works quite well, and has become one of the most widely

used key-finding algorithms. It is easy to implement and runs sufficiently fast for

most applications. It also requires very little music theory to understand and is

conceptually straightforward. On a high level, it makes sense that certain notes will

be played more than others in a given key. The fact that this intuition is confirmed

with Krumhansl and Kessler’s perceptual data further validates the approach.

However, there are also several well-known and well-founded concerns with the

K-S algorithm. Researchers have questioned the effectiveness of ignoring timing in-

formation as well as the methods themselves in the perceptual test that derived the

key profiles.

Concerns regarding the exclusion of local timing information

The primary concern with the K-S algorithm is that it is a distributional algorithm,

meaning the input vector ignores all timing and sequencing information, using only

the frequency of a pitch class to determine the key. Many have argued that a metric so

statistical is not sufficient for key identification, suggesting ordering, location, group-

ing, and intervals are also relevant traits, among others (often called structural or

functional attributes [24, 9, 25]). It is relatively simple to generate example melodies

that share their pitch distribution but induce different keys (such as G−E −D−Cimplying C major while E−C−D−G indicates G major), demonstrating that other

information is needed in these cases. In a study by Brown et al [7], it is demon-

strated that the presence of a sequenced tritone (a so-called rare interval) influences

a listener’s key decision.

Several studies have examined the signifiance of this limitation. West and Fryer

[88] asked subjects to rate a probe tone as a tonic after the notes of a scale are played

1Temperley addresses this by proposing other key profiles, as will be discussed later.


in a random order. When the results do not show consensus for either musicians

or non-musicians, it is concluded that sequencing information is necessary for key

estimation. In [55], Matsunaga and Abe present musicians and non-musicians with

similar randomly ordered pitch sequences, this time generated from the sequence

{C,D,E,G,A,B}, and ask them to identify the key. This time, the results gener-

ally suggest that the pitch set is an important characteristic in key identification,

as listeners were able to agree on the key of the sequence in many cases. But, in

other cases, significant disagreement among the listeners indicates that pitch-class

distribution alone is not sufficient for defining a key. Interestingly, in both of these

studies, musicians and non-musicians responded similarly, suggesting that key per-

ception does not require or improve significantly with musical training. Two other

studies [72, 79] draw the randomly ordered sequences from statistical distributions,

and in both cases the listeners agree on the key in a reasonable number of cases. In

general, the perceptual tests confirm that pitch distributions are useful in identifying

the key, though they are not always completely sufficient.

Unfortunately, the loss of local timing information is unavoidable with distribu-

tional key-finding algorithms (in fact, it is fundamental to them). However, in [79],

Temperley and Marvin observe that most of the examples used in the above stud-

ies are relatively short (sometimes only a few notes long) and simple. So, somewhat

counterintuitively, the lack of local timing may be less of a concern for longer excerpts,

when more samples (notes) give a theoretically more accurate pitch distribution.

Huron and Parncutt [36] also proposed an extension of the K-S algorithm that

tries to account for these timing concerns. Their model includes weightings related to

perceptual salience of the input, and also to the decay over time (a memory effect).

By incorporating these features, they are able to improve the model’s performance

on several examples from Krumhansl and Kessler’s original report that were shown

to have time dependence.

Concerns regarding the exclusion of global timing information

A related concern to the lack of local timing information is the lack of global timing.

Distributional key-finding algorithms must assume that the key is stationary over


the entire excerpt used for the estimate. This makes the inclusion of a modulation

extremely difficult to handle. When a full score is collapsed down to 12 duration

measurements and given only one key estimate, it is simply not possible to find and

label modulations, should that be desired. Furthermore, a modulation will corrupt the

one key estimate that is given. This is because, if there is a modulation somewhere

in the score, then the notes will be drawn from two key distributions, that of the

original key and that of the modulated key. As a result, the pitch distribution will

be a hybrid of the two, and will not fit as well into a classification of either key.

This effect is not always undesirable. The hybridization of the input vector as a

result of content from multiple keys can be used as a tool for understanding the music

itself [78]. Also, a study showed that the strength of a melody’s key match correlates

to a perceptual judgment of that melody’s tonal structure [75].

However, if finding modulations is desired, these issues can also easily be solved

by limiting the duration of the musical excerpt used. The first example of this was

proposed by Krumhansl [43], in which a separate key decision is made for each measure

of the score by combining the current measure’s input vector with weightings of the

neighboring measures.

This idea was also extended by Shmulevich and Yli-Harja [70], in which a fully

sliding window is used. The output can then be filtered with either a median filter or

a graph-based smoothing, resulting in a stepwise set of key judgments. An advantage

to this approach is that the very definition of a modulation (a full change in key,

as opposed, for example, to tonicization, in which harmonic movement only briefly

moves towards a secondary key) can be controlled by changing the parameters of the

smoother. The authors also show that this approach can be used for musical pattern

recognition.

Temperley [76, 77] proposed an approach for handling modulations by assigning

a penalty to a key change. In this system, the optimal key judgment is then deter-

mined by some combination of how well a key fits the pitch distribution with the

penalty assigned to each key. This approach naturally leads to a Bayesian framing

of the algorithm, in which the key correlation scores and key-change penalties are


represented with probabilities. Additionally, a variation on the input vector was pro-

posed, in which, within a short window, only a binary present/absent metric is kept

for each of the 12 pitch classes, preventing the algorithm from overvaluing anomalous

pitch repetitions (though this obviously also increases sensitivity to notes that appear

sparsely).

Concerns regarding the probe-tone method

Another concern about the K-S algorithm involves the perceptual test itself. It has

been suggested that the probe-tone method is biased by the timing of the test tone,

causing subjects to judge how the tone fits at the final position of the sequence

rather than simply how it fits with the sequence. Aarden [2] tested this hypothesis,

with results suggesting that listeners are indeed biased to hear the probe tone as the

phrase-final note. One common extension to address this issue is to replace the key

profiles derived by Krumhansl and Kessler.

Drawing from the fundamental assumption of distributional key-finding that a key

is determined statistically, Aarden derived new key profiles learned from the statistics

of a musical corpus. Tests showed that these new profiles performed better than

the original K-K profiles. Temperley and Marvin [79] also used profiles statistically

derived from a musical corpus to create the musical sequences used in their perceptual

study.

Temperley [76] tried a modification on the key profiles in which entries are rounded

and simplified, leading to profiles with most scores shared between the major and

minor keys, with only the 3rd, 6th, and 7th degrees differing, as is the case in the scales.

The major and minor profiles also have equal norms (this allows for the elimination

of the denominator in the correlation in Eq. (2.1), as previously discussed).

Hu and Saul [35] learned key profiles from a musical corpus using Latent Dirichlet

Allocation (LDA). In this context, not even key was defined. Rather, musical scores

were analyzed and organized into 24 categories, or topics, and the key profiles were

derived as the mechanism for assigning these topics.

It is worth noting, though, that in all of these cases, the new key profiles share

many of their characteristics with the original K-K key profiles.


2.2.2 Other key-finding algorithms

Most key-finding algorithms outside the K-S family are referred to as structural or

functional algorithms, so called because they incorporate the timing and ordering

data.

One early example is a rule-based system proposed by Longuet-Higgins and Steed-

man [52, 51]. In the approach, the notes of a melody are looked at one at a time.

Starting from the beginning, each note eliminates from consideration all keys in which

the note is absent. This process is continued, moving along the melody, until only

one key remains. If multiple keys remain at the end of the melody, then the key is

selected among the remaining candidates based on the first note in the melody. The

theory behind this approach is based on the observation that the key-tones for a given

key occupy a compact space in the Tonnetz, seen in Fig. 2.2.

The rare-interval approach to key-finding [9, 7] is another structural algorithm

that seeks out sequential intervals that are seen as specific to certain keys. The

most prominent example is the tritone, which is largely unique to a single key, such

as between the 4th and 7th degrees of a major scale. It is suggested that the most

perceptually likely method for key identification is to determine a tonal center based

on evidence of these rarely occurring intervals.

Rizo et al [61] proposed a key-finding algorithm derived from a tree representation

for representing melodies [60]. The tree structure is built so that time is represented

from left to right and duration of a note is represented by tree level. By iteratively

assigning a key to each leaf in the tree and moving up the representation, one key

estimate is left in the end.

The Center of Effect Generator (CEG) is a distributional key-finding algorithm

(like the K-S algorithm) proposed by Chew [13, 14] based off her geometric represen-

tation, the Spiral Array (seen in Fig. 2.4). Each combination of notes is represented

by a single point, called the center of effect (c.e.), and the key is defined based on

the distance between the excerpt’s c.e. and the c.e. calculated for a key. Like

Longuet-Higgins and Steedman’s algorithm, the CEG is based on the observation

that key-based tonal notes form compact spaces and shapes in the Spiral Array. In

fact, the Spiral Array is based on the Tonnetz, so it is not surprising that key-finding


algorithms in each space are grounded in similar properties.

2.3 Meter Induction

Meter generally refers to the organization of the rhythm and beats in music. It

consists of a periodic system of emphasized down beats, and is generally broken into

segments based on either 2 (duple meter) or 3 (triple meter) beats, or one of several

common multiples. Like the key, determining the meter is usually very natural and

intuitive for a human listener, but, as of yet, though great strides have been made,

computational models have not achieved comparable results. The meter of a musical

piece can be valuable information on its own, and is also important for tasks like

musical transcription [66] and editing [11].

The field of meter induction is also not quite as widely studied as key finding,

but some past work has aimed at computationally extracting meter. Beat tracking

and tempo estimation from real audio are related rhythmic tasks, but these tasks are

different than meter induction and therefore will not be dealt with here (though a

few systems have tried to incorporate them all together on some level [31, 42]). The

diffusion-based organization to follow is based on the autocorrelation approach for

meter induction, but several other methods will be discussed as well.

Most meter induction methods are based on symbolic data, but can also be applied

to musical signals, typically after extracting onset information.

2.3.1 Autocorrelation

The concept of applying autocorrelation to meter induction was first proposed by

Brown [8]. Using an autocorrelation function to analyze the rhythmic timing of events

is designed to capitalize on the periodic structure of meter. Presumably, the music

has repetitions of rhythmic sequences to create the sense of meter, and these rhythmic

sequences occur at similar places within the metrical structure. The autocorrelation

function will show spikes at the timing difference between the repeated events, and

that timing difference is a clue toward the meter.


Autocorrelation has also been applied to isochronous melodies using pitch infor-

mation [86]. In addition to being a relevant example of autocorrelation, this is also

one of the few times that pitch information has been incorporated into meter induc-

tion, even though pitch information has been shown to be perceptually relevant to

meter [32].

Toiviainen and Eerola [81, 82] extended these models by weighting the rhythmic

events based on several criteria, including duration, melodic accent, melodic interval,

and melodic trajectory. Results suggested that autocorrelation functions based on

only the onset timing with no accent weighting at all yielded the most accurate meter

classification.

2.3.2 Other meter induction methods

Longuet-Higgins and Steedman [52] proposed a system for meter induction very simi-

lar to their key-finding algorithm. In the meter version, rhythmic events are analyzed

one at a time starting from the beginning, eliminating meters that are unlikely to

see that event with each new input. This rule-based system is the metrical dual to

the key-finding version, in which the pitch-classes are analyzed and eliminate keys

instead of rhythmic timing eliminating meters.

The Generative Theory of Tonal Music (GTTM) [49] is an important example of

computational music analysis with strong ties to meter. Fundamentally, it generates

structure and understanding of a musical piece with a predictive model based on a

set of analysis rules. In the context of meter induction, GTTM strictly ties meter to

tonal harmony by organizing music into a metrical structure which is derived from

multiple levels of meter and rhythm. This approach aims to extract and explain the

sense of strong and weak beats that music creates for a listener and is a key aspect

of meter. Based on this accent structure, the predictive model and its set of rules

determine the meter that a listener is likely to experience for a given musical excerpt.

Large and Kolen [47] proposed a system for determining meter based on the oscil-

lations of resonators. In it, a bank of oscillators organize based on a rhythmic input,

and the resulting organization helps determine the meter. This process is suggested


to model human perception of meter.

Parncutt [59] suggested another model for the perception of musical meter. In-

corporating several aspects of GTTM, the model estimates the accent structure of a

rhythmic input through a multi-stage process in which different types of accents are

separately estimated. The model then estimates both the meter and the expressive

timing of the input. This model was designed based on several perceptual tests, and

behaves similarly to those experimental results for the same tests.

Dixon [26] suggested using multiple metrical hypotheses, determining the meter

as the hypothesis that best fits with the inter-onset timing data. This specific system

was used on the very difficult task of determining meter for expressive performances,

in which the tempo is not guaranteed to be consistent throughout the piece. Including

a perceptual estimation of the metrical relevance of rhythmic events in the hypothesis

search as a weighting improves the results as well. These perceptual estimations make

use of pitch information in addition to the temporal information.

2.4 Musical Visualization

Visualizing music has long been a tool for analysis and understanding of musical

structure on many levels. From analyzing chords to estimating keys, mapping the

musical data to some corresponding geometric space has been extensively used to

help find a deeper understanding of music theory and its underlying principles.

Generally speaking, visualizations of music theory can mostly be separated into

two categories: those which organize the pitch classes (C, C], D, D], ...) or notes

(pitch class with octave information), and those which visualize the keys (C major, c

minor, C] major, c] minor, ...). As a notational matter, uppercase letters indicate a

major key, while lowercase letters indicate a minor key.

2.4.1 Note or Pitch-Class Visualizations

One of the earliest attempts to visualize music theory is called the Tonnetz, also called

the table of tonal relations or the harmonic network, which plots the 12 pitch classes


B F# C# G# D#

D A E B

1R L.VL r

Bb F C G D

Db Ab Eb Bb

(a) A region of traditional 2-D Tonnetz and three contextual inversions

(sharing a common edge) with a C-major triad

/M3 axis

P5 axis (b) The axis system of the traditional 2-D Tonnetz

Figure 1

and have labeled these a, b, and c (avoiding labels x, y, and z to distin-

guish my axes from those of the Cartesian coordinate system). Axes are labeled according to a right-handed convention (right thumb points in a

positive direction along the a-axis, right index finger in a positive direction along the b-axis, middle finger in a positive direction along the c-

axis). Points within the lattice are arranged at unit distances in positive and negative directions along the axes from all other points. The regular arrangement of points in this Tonnetz constitutes one of two uniform

ways of filling space with spheres-crystallographers refer to this ar-

rangement as cubic closest packing (ccp).7

Throughout this discussion, we will assume equal temperament. Do-

ing so induces a modular geometry to the 3-D Tonnetz-the Tonnetz

occupies the closed, unbounded volume of a hyper-torus in 4-dimensional space. Units in a positive direction along the a-axis correspond to the musical interval of 4 semitones, units in a positive direction along the b-axis correspond to the interval of 7 semitones, and units in a positive direction along the c-axis correspond to the interval of 10 semitones. Any

197

Figure 2.2: The Tonnetz, harmonic network, or table of tonal relatives (from [29]),including the visualizations of parallel (P), leading-tone (L), and relative (R) triadrelationships from Neo-Riemannian theory.

in a two dimensional space, seen in Fig. 2.2. An early version was first proposed by

Leonhard Euler [27] as a 4 x 3 matrix with notes separated by a fifth on one axis

and by a major third on the other. Later, Arthur von Oettingen [57] expanded on

the concept by transposing the visualization and extending as an infinitely repeating

pattern. The visualization was more recently revived in music theory by Daniel

Lewin [50] in 1982, who, along with others [37, 15, 16], recognized the relevance

of the mapping to the fundamental triads of Neo-Riemannian theory. This leads

to interesting visualizations of music theory and voice-leading in which the musical

processes are visualized as rotations and transformations of triangles in the Tonnetz.

In addition to uses for music theory, the Tonnetz has been used as a tool for key

estimation, based on the observation that the primary chords for a certain key form

compact shapes in a unique region for that key, though in this case the Tonnetz was a

2D projection of a proposed 3D mapping that used the octave as the third dimension

[52, 51]. Other 3D extensions have been used as well [29]. The Tonnetz has also been

animated for visualizing the evolution of consonance throughout a score [6].

However, an inconvenience of the Tonnetz visualization is that, in order to ac-

count for the cyclic aspects of Western music theory, it must repeat infinitely in each

direction. Along with being cumbersome, this requires that each note occupy multiple


8 • C. L. Krumhansl

!

!

"#$!#%&'()*+,!-.!/.)*+)0-.&*.),1!2%34!51!6%4!71!89)%:*+!;<<=4!

&%,)!3->*3?!)+-0@!A%+!0!,*)!%A!)%.*,4!B)!-,!*C'3-9-)!-.!#D*E!0.@!F+0.G%-,H!I)D-,!-,,(*J!-.!)D*!

K-,(03!+*'+*,*.)0)-%.!%A!9D%+@,1!ED-9D!0''*0+!0,!)+-0.L3*,!-.!)D*!,'-+03!0++0?4!".@!)D*!>*?,!

E-)D!)%.-9,!,*'0+0)*@!:?!)D-+@,!0''*0+!%.!D%+-M%.)03!3-.*,!-.!)D*!)%+(,!%A!>*?,!IN%-K-0-.*.1!

)D-,!-,,(*J4!

ND*,*!)E%!-.)*+K03,1! )D*!'*+A*9)!A-A)D!0.@!&0O%+!)D-+@1!03,%!0''*0+!-.!0!L+-@!'+%'%,*@!:?!

P%.L(*)QR-LL-.,! 0.@! S)**@&0.! TUVWUX4! ND*! L+-@! E0,! (,*@! -.! )D-,! *0+3?! 0))*&')! )%!

0()%&0)*!>*[email protected]!R*+*!)D*!A-A)D,!0+*!%.!)D*!D%+-M%.)03!@-&*.,-%.!0.@!)D*!)D-+@,!0+*!

%.! )D*! K*+)-903! @-&*.,-%.1! 03)D%(LD! %)D*+E-,*! -)! -,! *,,*.)-033?! *Y(-K03*.)! )%! )D*!

+*'+*,*.)0)-%.! '+%'%,*@! :?! /33-,4! ND*?! .%)*! )D0)! )D*! ,903*! )%.*,! %A! #! &0O%+! A033! -.! 0!

9%&'09)! +*L-%.! I,D%E.! %.! )D*! 3*A)J! 0.@! )D*! ,903*! )%.*,! %A! #!&-.%+! A033! -.! 0! 9%&'09)!

+*L-%.!I,D%E.!%.!)D*!+-LD)J4!ND*?!(,*@!)D-,!&0'!%A!)%.*,!0,!)D*!:0,-,!A%[email protected]!)D*!>*?,!

-.! Z09D[,! \*33Q)*&'*+*@! #30K-*+! I,**! ]+(&D0.,3! TUVV<X1! A%+! &%+*! @*)0-3,! %A! )D*-+!

0''+%09DJ4!

!

!!F-L4!74!ND*!&0'!%A!&(,-903!)%.*,!(,*@!-.!P%.L(*)QR-LL-.,!0.@!S)**@&0.[,!TUVWUX!>*[email protected]!03L%+-)D&4!!

!

!

!!F-L4!=4!NE%!D*3-903!+*'+*,*.)0)-%.,!%A!&(,-903!'-)9D!'+%'%,*@!:?!SD*'0+@!TUV^;X1!)D*!,-.L3*!D*3-C!I3*A)J!:0,*@!%.!

'-)9D!'+%C-&-)?!%.!)D*!9D+%&0!9-+93*!0.@!%9)0K*!*Y(-K03*.9*1!0.@!)D*!@%(:3*!D*3-C!I+-LD)J!:0,*@!%.!)E%!

ED%3*!)%.*!,903*,!ED*+*!)D*!'+%O*9)-%.!%.)%!)D*!D%+-M%.)03!-,!)D*!9-+93*!%A!A-A)D,4!Figure 2.3: Several geometric representations proposed by Shepard, from [45].

locations in the projection, a conceptually problematic proposition. For this reason,

many other theory-based visualizations use circular shapes.

One such visualization was proposed by Shepherd [69], in which the notes are

organized as a helical structure, increasing chromatically as they ascend the spiral

(Fig. 2.3). The helix is organized so that octaves of the same pitch class are directly

above or below each other. A projection down the octave dimension (the height of the

helix) yields the chromatic circle. The representation was also extended to include

the perfect fifth, leading to a double helix (labeled (c) in Fig. 2.3).

Chew [13, 14] proposed a representation that built off of the previous visualiza-

tions. Her mapping, called the Spiral Array (Fig. 2.4), organizes the notes as an

ascending spiral, increasing by perfect fifths and rotating 360◦ every four notes. This

creates a vertical axis that relates the major thirds. Chew also observes that this ar-

rangement is essentially a reshaping of the Tonnetz in which the redundancies overlap

by rolling it into a spiral shape. In the Spiral Array, chords occupy unique triangular

spaces. By projecting these triangles onto a representative point, the chords then

create a similar spiral within the note spiral. Then, by similarly defining a key based

on its tonic, dominant, and subdominant chords, a unique triangle for each key can

be projected on its own point, creating yet another spiral within the note and chord


CG

D

A

E

B

F

B"

ascend by a

perfect fifth

interval

ascend by a

major third

interval

Figure 2.4: The Spiral Array, from [13]

spirals. This creates an interesting space for simultaneous visualization and analysis

of multiple levels of music.

Others have created more mathematically-derived spaces to geometrically map

the notes and chords [33, 84, 10]. These spaces are unique from the others in that

they are based on musical operations rather than simply notes or chords. One such

space is, by design, unaffected by octaves, permutations, transpositions, inversions,

and cardinality changes (the so-called OPTIC operations), because it is argued that

the musical listening experience is largely immune to these operations as well.

2.4.2 Key Visualizations

Schoenberg [67] developed a mapping, seen in Fig. 2.5, that is somewhat similar to the

Tonnetz (Fig. 2.2). However, unlike the Tonnetz, the elements in the visualization are

the 24 keys (rather than the 12 pitch classes). This mapping organizes keys such that

dominants are neighbors along the vertical axis while the horizontal axis proximates

relative and parallel major/minor keys. The design is such that the nearest neighbors

for a key are also the most common modulations seen in Western music.

Another mapping of the keys originates from the key-profiles that Krumhansl and

Kessler [46] derived from perceptual data. Using Multidimensional Scaling (MDS)

on the inter-key distances created from the key-profiles, they found a 4D space (Fig.


The Geometry of Musical Structure: A Brief Introduction and History • 5 !

!

"#$!#%&'()*+,!-.!/.)*+)0-.&*.),1!2%34!51!6%4!71!89)%:*+!;<<=4!

!!

!!

>-?4!;4!@*'+*,*.)0)-%.,!%A!B*C!+*30)-%.,D-',!A+%&!E)%'F!G9D%*.:*+?!HIJKJLIJ=7M!0.N!E:%))%&F!O+(&D0.,3!0.N!

O*,,3*+!HIJP;M4!

!

!

'(:3-,D*N! IQ5=4! RD*! 9-+93*! %.! )D*! +-?D)! 9%&*,! A+%&! S*-.-9D*.! HIJKJM1! %+-?-.033C!

'(:3-,D*N! IQ;P1!TD%!T+%)*! 0! '-*9*! %A!&(,-9! )D0)! '+%?+*,,*,! ,C,)*&0)-9033C! 0+%(.N! )D*!

9-+93*4! ERD-,! 0.N! )T%! %)D*+! ,(9D! '-*9*,! 90.! :*! A%(.N! -.! !"#$$% &'()*+,% -)#*,$(% ./#%

0$12/+#3!H@0,9D!IJP5M4F!

U%)D! 9-+93*,! 0+*! )+C-.?! )%!?+0''3*!T-)D! )D*!'+%:3*&! )D0)! *V*+C!&0W%+!B*C! -,! 93%,*3C!

0,,%9-0)*N!T-)D!A%(+!N-AA*+*.)!B*C,X!)D*!B*C,!T-)D!)%.-9,!0!'*+A*9)!A-A)D!0:%V*!0.N!:*3%T!

-),!)%.-91!0.N!)D*!+*30)-V*!0.N!'0+033*3!&-.%+!B*C4!#%.,-N*+!)D*!+*A*+*.9*!B*C!#!&0W%+!E#!

N(+1!0)!)D*!)%'!%A!*09D!9-+93*F4!",!.%)*N!0:%V*1!-)!,D%(3N!:*!93%,*!)%!Y!&0W%+!E0!A-A)D!0:%V*!

#!%.!)D*!9-+93*!%A!A-A)D,F!:*90(,*!)D*-+!,903*,!N-AA*+!%.3C!-.!%.*!)%.*!E>!V,4!>ZF4![)!-,!03,%!

93%,*!)%!>!&0W%+!E0!A-A)D!:*3%T!#!%.!)D*!9-+93*!%A!A-A)D,F!TD%,*!,903*!03,%!N-AA*+,!A+%&!#!

&0W%+!%.3C!-.!%.*!)%.*!EU!V,4!U2F4!#!&0W%+!-,!03,%!93%,*!)%!"!&-.%+!E"!&%33F1!-),!+*30)-V*!

&-.%+1!:*90(,*!)D*!#!&0W%+!,903*!E#!\!/!>!Y!"!UF!D0,!)D*!,0&*!)%.*,!0,!)D*!"!&-.%+!

.0)(+03! ,903*! E"! U! #! \! /! >! YF1! %.3C! )D*C! :*?-.! %.! N-AA*+*.)! .%)*,! 0.N1! )D(,1! D0V*!

N-AA*+*.)!)%.-9,4!>-.033C1!#!&0W%+!-,!03,%!93%,*!)%!#!&-.%+1!-),!'0+033*3!&-.%+1!TD-9D!D0,!

)D*! ,0&*! )%.-9! E0.N! 03,%! )D*! ,0&*! A-A)D! ,903*! )%.*! 0.N!2! 9D%+N1! ,)+(9)(+033C! -&'%+)0.)!

*3*&*.),!-.!)D*,*!B*C,4F!!

RD*!A-+,)!9-+93*!0))*&'),!0!,%3()-%.!:C!?+%('-.?!)%?*)D*+!)T%!.*-?D:%+-.?!&0W%+!B*C,!

%.! )D*!9-+93*!%A! A-A)D,!EA%+!*]0&'3*1!#!&0W%+!0.N!Y!&0W%+F!0.N!A30.B-.?! )D*&!%.!*-)D*+!

,-N*!T-)D! )D*-+! +*30)-V*!&-.%+!B*C,!E0!&-.%+!%.! )D*! 3*A)!%A!#!&0W%+!0.N!/!&-.%+!%.! )D*!

+-?D)!%A!Y!&0W%+F4!RD-,!,%3()-%.!,0)-,A-*,!)D*!+*30)-V*!&0W%+^&-.%+!+*30)-%.,D-'!0.N!B**',!

Figure 2.5: Schoenberg’s spatial mapping of keys (from [45]), with the key region forC major marked

TONAL ORGANIZATION IN MUSIC

MULTIDIMENSIONAL SCALING SOLUTION OF KEYS

345

ft ««Db ^

g» f Ek

Bc' c gi,

E»

iD e G°

g»Al>E c e

f c8 BrEkf a G

b

Ok ~ g d*

F A

fc DBk

d b* p»

DIMENSIONS I AND 2 DIMENSIONS 3 AND 4

Figure 4, The four-dimensional multidimensional scaling solution of the intercorrelations between the24 major and minor key profiles (stress = .017). (The projection of the solution onto the first twodimensions is shown on the left. In this projection the circle of fifths for major and minor keys wasobtained. The projection onto the last two dimensions is shown on the right. Major and minor keysseparated by an interval of a major third were represented as single points in the solution. Each majorkey was located next to its parallel minor key on one side and its relative minor key on the other.Similarly, each minor key was flanked by its parallel and relative major keys.)

mensions as shown in Figure 5, where it isunderstood that the opposite edges of therectangle are identified.

To see the similarity between this repre-sentation and the more familiar (but mis-leading) description of a torus as an inner-tube, first visualize folding the rectangle overto identify the two horizontal edges, makinga hollow tube, followed by wrapping itaround to line up the two open ends. Notonly does this introduce distortions of dis-tances (the inner radius is smaller than theouter radius) but it also leads to false intu-itions about the interdependence of coordi-nates in the different dimensions. For thesereasons the flattened out representationshown in Figure 5 is preferable.

The (B, <j>) coordinates of the 24 keys werecomputed from the four-dimensional scalingsolution and are shown in the torus repre-sentation in Figure 5. Because of small de-viations from perfect circularity in the ob-tained scaling solution, ideal coordinateswere computed for each key such that thepoints were constrained to fall on the surfaceof the torus. The program CMH2 (Cliff,1966), which shifts, rotates, and expands orcontracts one configuration to best match asecond configuration, was used to comparethe ideal configuration with that actuallyobtained in the scaling solution. This pro-

gram gave a correlation of .998, indicatingthat the torus is in fact a very accurate rep-resentation of the four-dimensional scalingsolution of the profile correlations.

More importantly, when viewed as (6,<j>) coordinates on a torus, the pattern of in-terkey distances becomes entirely interpret-able. First of all, the keys separated by fifthsfall on a path that wraps three times aroundthe torus before joining up with itself again;the major keys fall on one such path, andthe minor keys on another, parallel path.These are lined up so that any major key isflanked by its relative minor on one side andits parallel minor on the other. These par-allel, relative, and fifth relations are madeexplicit for C major in Figure 5.

There is a striking similarity between themap of tonal regions given by Schoenberg(1969) for major and minor keys (Figure 1)and local regions of the toroidal represen-tation obtained here (Figure 5). The torus,however, has the advantage of simulta-neously depicting all interkey relations, notjust those immediately surrounding a singlemajor or minor key. Moreover, precise quan-titative comparisons between interkey dis-tances can be made from the torus repre-sentation. For example, a major key is infact found to be more closely related to itsrelative than to its parallel minor as seen,

(a) 4D space

346

ek/d

CAROL L. KRUMHANSL AND EDWARD J. KESSLER

f » d bk-

qt

a ' fx /

Relative^ /

V«. 1 Parallel A"

/ Circle of fifths

eFigure 5. An equivalent two-dimensional map of the multidimensional scaling solution of the 24 majorand minor keys. (The vertical [dashed] edges are identified and the horizontal [solid] edges are identified,giving a torus. The circle of fifths and parallel and relative relations for the C major key are noted.)

for instance, as the closer distance betweenC major and A minor than between C majorand C minor. Another finding is that a majorkey is closer to the minor key built on itsthird scale degree (the relative minor of thedominant key) than it is to the minor keybuilt on its second scale degree (the relativeminor of the subdominant key). For in-stance, C major and E minor are closer thanare C major and D minor. In fact, this wasanticipated by Schoenberg (1969, p. 68) andmay reflect the greater number of chordsshared by the major key and the relativeminor of the dominant key. Other compar-isons of this sort can easily be made usingthis spatial map of key regions. Moreover,this representation provides a framework forapproaching the problem taken up later ofhow chords relate to different tonal centersand how the sense of key develops as thelistener hears sequences of chords.

Other spatial representations. Other pre-viously proposed spatial representations havebeen described in detail by Shepard (1982a;see also Shepard, 1981, 1982b) and will bementioned only briefly here. The first suchrepresentation in the literature is the helicalstructure of single tones (Drobisch, 1846,cited in Ruckmick, 1929; Pickler, 1966; Re-vesz, 1954; Shepard, 1964). In this three-dimensional configuration, the single tonesare spaced along a helical path in order of

increasing pitch height such that tones sep-arated by an octave interval are relativelyclose as the helix winds back over itself onsuccessive turns. The projection of the pointson the plane perpendicular to the verticalaxis of the helix is often referred to as "tonechroma," and the projection on the axis as"tone height." This representation, then, si-multaneously specifies the close relation be-tween tones separated by small intervals andthat between tones at octave intervals. Thishelical representation is preferable for mu-sical pitches to the unidimensional psycho-physical scale of pitch that combines bothfrequency and log frequency as proposed byStevens (Stevens & Volkmann, 1940; Ste-vens, Volkmann, & Newman, 1937), be-cause there is a strong identification betweentones differing by octaves. This octave effectis seen in their interchangeable use in musictheory and composition and in judgments ofintertone relatedness (Allen, 1967; Boring,1942, p. 376; Krumhansl, 1979; Licklider,1951, pp. 1003-1004; and numerous othertreatments). Shepard (1982a) argues thatthe tones should be equally spaced in termsof log frequency around the helix becauseboth the selection of musical tones and per-formance in transposition tasks (Attneave& Olson, 1971) indicate that the log fre-quency scale applies to musical pitches.

Shepard (1981, 1982a, 1982b) has re-

(b) 2D remapping

Figure 2.6: The MDS derived mapping of the keys from Krumhansl and Kessler’skey-profiles, and its 2D mapping from the angles of the circles, from [46].

2.6(a)) in which two dimensions create major and minor circles of fifths (though the

circles are related by a major second instead of the more common relative or parallel

relationship) and the other two dimensions group major or minor keys separated by

a major third. Krumhansl and Kessler observed that these two circular shapes could

be transformed to a single 2D space in which the coordinates are defined by the

angle within each circle that a key occupies, seen in Fig. 2.6(b). Interestingly, this

transformation creates a space almost identical to Schoenberg’s.

Many other key-based visualizations have sought to map the key of a given musical

excerpt, oftentimes utilizing one of the visualizations already discussed for the process.

Mardirossian and Chew [54] animate the process as the music progresses, dynamically


(a) Linear scale (b) Log scale

Figure 2.7: Two harmonic visualizations of Mozart’s Sonatina No. 1 in C majorK439B (Viennese), 1st Movement, from [62].

changing circles’ color (key) and size (confidence in that key estimate) to visualize

the key-based content of the piece. Toiviainen and Krumhansl [83] use contour maps

to visualize the strength of key decisions for an excerpt. Both of these visualizations

show the relative strength of keys, which offers great insight into the tonal content

of an excerpt. Animating the process visualizes the evolution of that tonality over

time, as well. The KeyGram and Key Correlation visualizations in [30] take similar

but slightly different approaches to the previous visualizations.

Sapp [63, 64] proposed a completely unique space for key estimation visualization.

The system hierarchically colors according to the key estimates derived from any key

detection algorithm performed on sliding windows of multiple lengths. Sapp presents

two versions of the visualization: a triangular shape that is linear in window length

(Fig. 2.7(a)), and a rounding of the triangle to give equal resolution to different

time scales (Fig. 2.7(b)). The plots are suggested as a means for comparing key

estimation algorithms. Sapp also observes that the hierarchical nature of the plots

gives a Schenkerian representation of the musical selection, and also complements

Lerdahl and Jackendoff’s hierarchical computational analysis [49].


2.4.3 Other visualizations

Not all visualizations use notes or keys as the basic unit for graphing. One common

goal is to geometrically represent an entire musical excerpt of melody as an object.

Foote and Cooper [28] suggest using a self-similarity matrix to visualize the structure

of music. Bello [5], as a step towards grouping music, demonstrated that performing

MDS on a self-similarity matrix graphs an excerpt as a trajectory through a low-

dimensional space. It should be noted that this approach is similar to the diffusion-

based visualizations to be presented.

Online, there are also several visualizations for the structure of a musical excerpt.

Wattenberg [87] creates a series of arcs and semicircles to show the structural repe-

titions and relationships in a musical piece. Malinowski [53] animates the score with

color and flowing circles to represent the performance of the piece.

In the field of music visualization, there have also been a few particular innovative

and unique examples. Aarden and Huron [1] created a program for plotting the

geographical origins of certain aspects of music theory onto a map, visualizing the

cultural origins of musical entities. Janata et al [40] show the regions of the brain

activated by certain tonal events. While this research was not intended specifically as

a visualization of the music, the mapping of the neural activity is itself a visualization

of the music, passed through the filter of the perceptual system.

2.5 Machine Learning in Music Analysis

Machine learning has permeated most engineering fields, and computational music

analysis is no exception. In musical signal analysis, machine learning is widely used.

Statistical learning approaches have been proposed for artist identification, instru-

ment identification, note/chord extraction, cover song identification, and music rec-

ommendation, among other tasks.

On the symbolic level, machine learning has played an important role, though not

quite as pervasive as in signal analysis. This is potentially, in part, because the type

of learning needed in symbolic music analysis is oftentimes more easily incorporated


automatically than learned through data. Additionally, many of the high-level tasks

in music analysis are based on human perception, which is possible to incorporate

into a machine learning system, but requires large quantities of manually labelled

data, which is always difficult to find. So, instead of using musical data to teach an

algorithm music theory and human judgment, those aspects are sometimes included

in the algorithm itself.

There are several examples where researchers chose to incorporate musical knowl-

edge for a classification task rather than use pure machine learning. In the K-S

key-finding algorithm, Krumhansl and Kessler’s perceptually derived key profiles [46]

were incorporated into the algorithm automatically. Temperley’s synthetic profiles

[76] were also created manually. Chew’s Center of Effect Generator (CEG) key-finding

algorithm [13] also incorporates prior knowledge of music theory, including the chord

composition of a key, rather than learning it from data. In meter classification, it is

common to recognize that, because the vast majority of western music is written in

either duple or triple meter, an autocorrelation feature will have stronger peaks at

lag times that are either multiples of 2 or 3 [8, 82].

Another popular application of machine learning principles is in the field of algo-

rithmic composition [58, 23, 34]. Algorithmic computing in general is a field in which

music is composed by following a set of rules or instructions, often by a computer in

recent years, though a computer is not necessary. In many cases, however, machine

learning principles are included so that the system can learn the rules from a musical

set. A model is created based on the existing music, and then that model can be

used to create new music. Preexisting knowledge can often be incorporated in these

systems as well.

However, there often remains motivation for utilizing pure machine learning into

symbolic music analysis.

2.5.1 Unsupervised

Unsupervised machine learning typically relies on automatic pattern or structure

recognition within the data rather than requiring the patterns to be defined by known


labels. In this regard, it is well suited for musical applications, because most attributes

of music are well structured and labels are difficult to come by for high-level tasks.

One example of unsupervised learning is the key-finding algorithm proposed by

Hu and Saul [35]. In this system, key-profiles were derived in an unsupervised fashion

using Latent Dirichlet Allocation (LDA). Instead of defining the 24 keys, this system

categorizes the musical corpus into 24 distinct labels, defined by their rating from

the unknown key-profiles. With only the assumption that the key-profiles are related

through transposition or shifting, the algorithm properly classifies the key more often

that the K-S algorithm on the test data. This algorithm is also advantageous because

it does not require any labeling at all.

Juhasz [41] designed an unsupervised learning system for extracting motives from

a folk song database by searching for repeated patterns with a self organizing map.

By performing this analysis on the melodies from different regions, the cultures them-

selves can be grouped based on their common musical traits.

2.5.2 Supervised

Some of the most straightforward uses of supervised machine learning occur in key-

finding. In several extensions of the K-S algorithm ([2, 79]), key profiles derived

from statistical analyses of musical corpi replace the K-K profiles. In this approach,

common pitch distributions for major and minor keys are learned from the data, a

clear example of supervised learning.

Several machine learning algorithms were examined in the context of style classi-

fication by Dannenberg et al [21]. In the work, Bayesian classifiers, linear classifiears,

and neural networks were all tested for the task of classifying styles based on 5 seconds

of trumpet music. All approaches performed with reasonably high success.

Tzanetakis et al [85] used pitch distributions (here referred to as folded Pitch

Histograms) to classify music based on genre. Here, k-Nearest Neighbors, a common

supervised learning algorithm, is used to assign genre labels to unknown music based

on the known labels.

Basili et al [4] also applied machine learning to the problem of genre classification.


In this study, many simple musical features such as meter changes and instrument

classes are used as input for numerous machine learning algorithms. In the end, no

algorithm stands out above the others, but it is concluded that even simple musical

features can result in reasonably good classification.

Hidden Markov Models (HMMs) were used by Chai and Vercoe [12] to classify

folk music melodies into their country of origin based on their tonal progression. In

this approach, separate HMMs are built for each region, encoding the different tonal

relationships for each case. It is then determined for an unknown melody which set

of tonal relationships were more likely to have produced that melody. The method

performs well, despite the high-level task it is asked to perform.

A similar approach was used by Mavromatis [56], in which HMMs were modeled

after Greek church chants. Using the results, it is possible to draw conclusions about

the theory behind the composition of the melodies.

Lee and Slaney [48] used an HMM system as well, except the states were chords

and the HMMs were built for each key. While this task was used for signal input,

the supervised learning portion was performed with MIDI information. Interestingly,

the transition probabilities derived from the model show common chord relationships

from harmonic progressions. So, in this case, music theory was derived from a musical

corpus.

Stamatatos and Widmer [73] devised an approach for distinguishing musical per-

formers playing the same piece using timing, articulation, and dynamic information.

Promising results were achieved in the classification by using discriminant analysis

on the features.

2.6 Conclusion

In this chapter, we summarized some of the most significant work related to the music

experiments to follow in later chapters, specifically related to visualizations, key find-

ing, meter induction, and machine learning. Through these experiments, diffusion-

based music analysis will make novel contributions to each of these fields. However,

before progressing to these novel contribusions, we will first need to thoroughly review


the mathematical theory behind diffusion mapping and the low-dimensional spaces it

creates for organizing the structure of high-dimensional data.

Chapter 3

Diffusion Maps

3.1 Introduction

Diffusion distance is a Euclidean metric in a non-linear space defined by diffusion

maps [20, 17, 18]. The process relates data points to each other based on local ge-

ometry. This approach is advantageous because local geometry avoids the pitfalls

of Euclidean distance in high-dimensional space, does not require any distribution

assumptions on the data set, and is highly robust to noise. Beyond these clear

advantages, the space created with diffusion maps also provides a low-dimensional

visualization of the structure of the data set, organized hierarchically from global to

local, and accurately stores the structure of the data in an efficient and light-weight

representation.

Diffusion has already been used in clustering [80], classification [74], data process-

ing for multiple applications [20], wavelet analysis [19], and non-linear independent

components analysis [71], among others. However, the applicability of diffusion to

a higher-level, structured rule-based system like music has not yet been examined.

Izmirli [38] used diffusion to classify tonal and atonal music with success, but ap-

proached the task as a classification problem and only touched on the implications of

the work for extracting and visualizing musical concepts and theory. However, it is

recognized that the low-dimensional visualization of the data set resembles the circle

of fifths, hinting at the possibility. In this work, the diffusion process will be applied

26

CHAPTER 3. DIFFUSION MAPS 27

to symbolic musical data with the goal of extracting, interpreting, and visualizing

characteristics of Western music theory.

3.2 Diffusion Theory

3.2.1 Data sets

To introduce diffusion, first we must define a data set X , consisting of K data points

in N -dimensional space.

X → {x0, x1, ..., xK−1} ∈ RN

Example data sets

Two examples of data sets will be used throughout this chapter, one consisting of

clusters and another of circles. Both are embedded in 2-D space (R2) for easier

plotting.

One data set, seen in Fig. 3.1(a), is equally drawn from 8 separate gaussian

distributions. The means of the gaussians were designed to create a hierarchical

structure. On the lowest level, each of the clusters is its own structure. Then, clusters

1 and 2 are very close, as are 7 and 8. Cluster 3 is also well connected to 1 and 2.

Then, up another level, clusters 1, 2, 3, 7, and 8 form a larger group on the left while

4, 5, and 6 create another group on the right. This data set will be used to show the

hierarchical nature of diffusion distance.

The second data set (Fig. 3.1(b)) is drawn from three circular distributions, in

which the radius varies slightly from a set value, but the angle is selected completely

randomly. This data set will show the advantages of using a connectivity-based dis-

tance, rather than Euclidean distance. Using these two completely different data sets

will also demonstrate the distribution-free properties of random walk- and diffusion-

based metrics.


−15 −10 −5 0 5 10 15 20−8

−6

−4

−2

0

2

4

6

8

1

2

3

4

5

6

7

8

Student Version of MATLAB

(a) Clusters

−4 −2 0 2 4−4

−3

−2

−1

0

1

2

3

4

1

2

3


(b) Circles

Figure 3.1: Two example data sets that will be used to demonstrate the process ofcalculating diffusion distance. In both cases, color indicates the distribution fromwhich a sample was drawn


3.2.2 Affinity function

In the data space, a function k(xm, xn) measures some affinity between two points in

the set X . This function must satisfy two criteria:

1. Symmetry → k(xm, xn) = k(xn, xm)

2. Non-negative → k(xm, xn) ≥ 0

The affinity k can be any function that fulfills these properties, but, in practice, it is

often selected to be the exponential distance function

k(xm, xn) = exp(−σ||xm − xn||2) (3.1)

where the parameter σ is selected based on the data. In the context of this work, we

will also often use cosine distance for the affinity function

k(xm, xn) =xTmxn

||xm||||xn||=

N−1∑p=0

xm(p)xn(p)√√√√(N−1∑p=0

xm(p)2

)(N−1∑p=0

xn(p)2

) (3.2)

Note that, in order for the cosine distance function to satisfy non-negativity, all data

points xi must be non-negative as well.

The collection of affinities for the data set X can be used to create a Markov

process by defining p(xn|xm), the probability of moving to xn if starting at xm.

p(xn|xm) =k(xm, xn)

d(xm)where d(xm) =

K−1∑p=0

k(xm, xp) (3.3)

3.2.3 Affinity-derived Markov matrices

If we define a matrix Pmn = p(xm|xn), then we can step the Markov process forward

t steps by calculating P t. The entries of this matrix, pt(xn|xm), give the probability


of ending up in location xn after t steps if the starting location is xm.

All random walks have a stationary distribution π(xn), which defines the proba-

bility of ending up in each location as time approaches infinity. This is given by the

limit of the probability function pt for any starting point xp (in the limit, the starting

point is irrelevant).

π(xn) = limt→∞

pt(xn|xp) (3.4)

The eigenvector decomposition will give greater insight into this equation shortly.

If the graph is fully connected, meaning that it is possible to reach any point in the

data set from any other point in the data set, then the stationary distribution π is

unique. If it is not fully connected, then there will be multiple stationary distributions

πi, one for each separate component of the graph.

The random walk process prioritizes connectivity rather than simply distances,

and so data points that are highly connected are measured as close. This approach

is advantageous for several reasons. First, it is more robust to noise than Euclidean

distance. After all, perturbing the points in a data set will have a greater affect on

individual distances than on the connectivity. Second, it is distribution-free, meaning

that there are no assumptions made about the structures or shapes of the data.

“Hot Potato” analogy

A concrete way to visualize this process is to envision a game of “Hot Potato” in

which an object is passed between players. The object starts with one person who

tosses it to someone else, and that person then tosses to someone else, and so on.

Now let us say that, when someone is holding the hot potato, they are more likely to

toss it to someone close to them than far away. They also have the option of holding

on to the object, which, in this scenario, is the most likely outcome.

The probability distribution pt(xn|xm) then gives the probability of person xn

holding the hot potato after t tosses if it started at person xm. We can conceptually

draw a few simple conclusions based off this analogy. First, if t is small, xm and xn


Larger

Smaller

t

?

6

pt(·|x340) pt(·|x221) pt(·|x610)

−20 −10 0 10 20−10

−5

0

5

10


−20 −10 0 10 20−10

−5

0

5

10


−20 −10 0 10 20−10

−5

0

5

10


−20 −10 0 10 20−10

−5

0

5

10


−20 −10 0 10 20−10

−5

0

5

10


−20 −10 0 10 20−10

−5

0

5

10


−20 −10 0 10 20−10

−5

0

5

10


−20 −10 0 10 20−10

−5

0

5

10


−20 −10 0 10 20−10

−5

0

5

10


Figure 3.2: The probabilities of a random walk concluding in each data point (withhigh probability shown in lighter color) at different time scales for the cluster-baseddata set. The columns show the case for three different starting points, shown as ared dot in each.


Larger

Smaller

t

?

6

pt(·|x1) pt(·|x201) pt(·|x501)

−4 −2 0 2 4−4

−2

0

2

4


−4 −2 0 2 4−4

−2

0

2

4


−4 −2 0 2 4−4

−2

0

2

4


−4 −2 0 2 4−4

−2

0

2

4


−4 −2 0 2 4−4

−2

0

2

4


−4 −2 0 2 4−4

−2

0

2

4


−4 −2 0 2 4−4

−2

0

2

4


−4 −2 0 2 4−4

−2

0

2

4


−4 −2 0 2 4−4

−2

0

2

4


Figure 3.3: The probabilities of a random walk concluding in each data point (withhigh probability shown in lighter color) at different time scales for the circle-baseddata set. The columns show the case for three different starting points, shown as ared dot in each.


1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8


(a) P 100

1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8


(b) P 500

1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8


(c) P 5000

1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8


(d) P 100000

Figure 3.4: The probability matrix P t for the cluster data set (Fig. 3.1(a)) at severalvalues of t. Axis labels correspond to the cluster numbering from Fig. 3.1(a).


will almost certainly be close to each other. Second, if t is large, it makes sense that

someone in a crowded area is more likely to end up with the hot potato than someone

in a secluded area. Another reasonable conclusion is that, if the hot potato is in one

crowd, it is unlikely to move into another crowd in a small number of steps if the two

crowds are not well connected. Finally, and most importantly, the hot potato is more

likely to move between two points with many pathways of small steps between them

rather than two points the same distance apart with few pathways connecting them.

In other words, connectivity is more important than proximity.

All of these conclusions apply to data viewed as a random walk, and they can

easily be seen for the data in the next section. We will also revisit this analogy to

interpret the diffusion time constant.

Markov matrices for example data sets

The power of this probabilistic approach is visualized in Figs. 3.2 and 3.3 for the

cluster and circle data sets, respectively. These figures show the probabilities of the

random walk from three randomly selected starting points. In the cluster case (Fig.

3.2), the plots demonstrate how, as the time scale increases, the random walk spreads

outward from its point of origin. In the beginning, all three cases stick to their local

region. In the middle case, the probabilities divide the clusters into their left and

right subgroup. As t continues to increase indefinitely, all starting points yield the

same stationary distribution.

The circle case (Fig. 3.3) shows a similar progression. In the short-term case,

the activity stays in the neighborhood of the starting point. For this data set, the

middle-range time distribution is the most interesting. In this case, all three starting

points have a very high probability of spreading somewhere throughout their indi-

vidual circle. This indicates that, on some level, we have successfully segmented the

three circles. Diffusion distance will further clarify this separation. If t continues

to increase, the distributions of all three starting points once again approach the

stationary distribution π.

The hierarchical nature of the parameter t can also be shown at this point for the

cluster data set. Fig. 3.4 shows the full probability matrix P t for several values of


t. For t = 100, the smallest value plotted, seen in Fig. 3.4(a), the clusters are all

clearly seen in the matrix as a diagonalized set of blocks. Notice that cluster 1 and

2 are also starting to group together, as are 7 and 8 to a lesser extent. For t = 500

in Fig. 3.4(b), the next level of structure is extracted, grouping clusters 1, 2, and 3

into one set, 4, 5, and 6 into another, and 7 and 8 into a third. By increasing t to

5000 as in Fig. 3.4(c), the two higher level sets are grouped together. Finally, by

increasing t much higher (t = 100000 in Fig. 3.4(d)), all of the data points have the

same distribution, the stationary distribution.

In these examples, it is made clear that treating the data set as a random walk

is a valuable and powerful tool. Sets from two completely different distributions are

properly organized and even the hierarchical sub-structures can be extracted with

ease.

Prior uses of a Markov matrix for data analysis

It has been previously shown that calculating the Markov matrix P for the data set

X can be a very powerful tool.

Tishby and Slonim [80] first introduced the concept of analyzing data sets as

random walks. Their approach looked for stable regions in the decay rate of the

mutual information of the probability functions p as the time t increased, and used

those stable regions to define clusters in the data set. Szummer and Jaakkola [74]

used the Markov matrix of the data to assign labels to a data set with only a small

number of labels known.

However, in both of these cases, high-level, computation-intensive algorithms like

EM are required to process the data. Though the steps and logic that take the data

to the Markov matrix are relatively clear and straightforward, there is not a means

for similarly clear and straightforward analysis of the data. It is for this reason that

we use the Markov matrix P to move the data into diffusion space.


3.2.4 Eigenvectors of a Markov matrix

The first step towards a diffusion map from the Markov matrix requires the eigenvec-

tor decomposition. The eigenvectors of a matrix are vectors that, when multiplied by

the matrix, are only scaled and not rotated at all.

Pν = λν

φP = λφ

Here we see that there can be both right eigenvectors (ν) and left eigenvectors (φ),

and, in both cases, the vectors are scaled by λ, called the eigenvalue.

It is possible to do a full eigenvector decomposition, in which a matrix is defined

by its eigenvectors and eigenvalues.

P = ΛEΛ−1 =K−1∑l=0

λlφlνTl

P t = ΛEtΛ−1 =K−1∑l=0

λtlφlνTl (3.5)

Here, φl are the full set of left eigenvectors and νl are the right eigenvectors, and

λl are the corresponding eigenvalues. It is a property of Markov matrices that first

eigenvalue will be equal to one, and all eigenvalues will be less than or equal to one.

|λ0| = 1, |λl| ≤ 1 (3.6)

For a connected graph, the inequality is strict.

It is also a property that the first left eigenvector φ0 is constant for all data points

and the first right eigenvector ν0 is proportional to the stationary distribution of the

random walk π (when multiplied by φ0, the proportionality becomes an equality).

The stationary distribution is also the function d from Eq. (3.3).


ν0(xm) ∝ π(xm) = φ0(xm)ν0(xm) =d(xm)

K−1∑p=0

d(xp)

(3.7)

Another critical property of the eigenvalue decomposition is shown in Eq. (3.5),

where stepping the random walk forward t steps requires only raising the eigenvalues

to the t power. As a result, calculating the probabilities pt(xn|xm) can be done

efficiently with the eigendecomposition. Instead of self-multiplying a K x K matrix t

times, the calculation only requires K scalar exponential calculations. This property

is also essential for the diffusion distance, to be introduced shortly.

It is also important to note that eigenvectors can be efficiently calculated for data

points not included in the original data set using the Nystrom method [3, 89].

φl(y) =1

Kλl

K−1∑n=0

k(y, xn)φl(xn) for y 6∈ X

This property is extremely valuable for large data sets, because it allows for the

eigenvectors to be calculated on a represented subsampling and then extended, rather

than requiring a full eigenvector decomposition on the entire set. This also makes it

possible for eigenvector methods to be used for query-based applications.

3.2.5 Eigenvalue decay and the meaning of the eigenvectors

The magnitude restrictions in Eq. (3.6) imply that, as t increases, the remaining

eigenvalue λtl will decrease. Therefore, for larger t, eigenvectors corresponding to

sufficiently small eigenvectors can be ignored, assuming there is some tolerance for

error, meaning that the calculation of P t can be made even more efficient. Obviously,

smaller eigenvalues approach zero with smaller values of t than larger eigenvalues.

The decay of the eigenvalues also gives some insight into the meaning of the

eigenvectors themselves. As noted above, the probability of moving from one data

point to another in t steps can be written as a function of the eigenvectors.


pt(xn|xm) =K−1∑l=0

λtlφl(xm)ν(xn) (3.8)

In other words, the probability pt is a weighted sum of the product of the eigen-

vectors. However, for different times t, only the weighting changes in the sum, while

the product of the eigenvectors remains the same.

This realization, along with the previously noted property of smaller eigenvalues

decaying faster than larger eigenvalues, gives valuable insight into the meaning of the

eigenvectors. All of the non-trivial eigenvectors are required to calculate the proba-

bility for t = 1, and then, as t increases, they will disappear in reverse order of the

eigenvalue magnitude. This means that the eigenvectors with large eigenvalues en-

code the information for the long-term probability function, and the eigenvectors with

smaller eigenvalues encode the information for the short-term probability function.

To approach this in a different way, let us start with the limit presented in Eq.

(3.4) that states that the probability function for any starting state approaches the

stationary distribution as t approaches infinity. Another way to derive this is to

observe that as t approaches infinity, all eigenvalues strictly less than 1 will approach

zero. For a connected graph, all eigenvalues except l = 0 satisfy this, and so all but

l = 0 disappear.

limt→∞

pt(xn|xm) =K−1∑l=0

λ∞l φl(xm)ν(xn) = φ0(xm)ν0(xn)

As a brief aside, note that this derivation reinforces Eq. (3.7), and it also offers

further insight into the limit in Eq. (3.4). First of all, it gives a very simple and

clean interpretation of a previously complicated limit, because all of the eigenvectors

but one simply disappear from the summation. Also, the irrelevance of the starting

location is confirmed, because, as previously stated, the left eigenvector φ0 is constant

for all data points, and therefore the limit is unaffected by changing the initial data

point.


Back to the relevance of the eigenvalues to the meaning of the eigenvectors, let

us slowly decrease t. Because λ0 = 1, the weight of the stationary distribution will

remain unchanged, regardless of t. However, as t decreases, the eigenvectors with

larger eigenvalues start to appear and significantly affect the summation, followed by

those with smaller eigenvalues.

In this way, the eigenvectors encode the evolution of the probability function over

time, and the eigenvalues encode the time scale to which its eigenvectors are most

relevant. This interpretation is extremely important in understanding the value of

diffusion distance.

3.2.6 Diffusion maps and diffusion distance

Based on the properties of the eigenvectors of P , a diffusion map Ψt is proposed, in

which the coordinates are the scaled eigenvectors.

Ψt(xn) = [λt1φ1(xn), λt2φ2(xn), ..., λtLφL(xn)] (3.9)

Note that the first eigenvector (l = 0) is excluded from the mapping, because the first

left eigenvector φ0 is constant for all data points, and therefore would be a trivial

inclusion. For perfect representation, L, the number of dimensions of the diffusion

map, is equal to one less than the number of data points (K − 1). Alternatively, if t

is large enough to zero out smaller eigenvalues, or some error is acceptable, L can be

set smaller to allow for more efficient calculation and storage.

The diffusion map also provides an opportunity for visualization of the data set.

While the dimensionality of the map itself will often be more than 3, the trait observed

above that the eigenvectors are organized and categorized based on their relevant time

scale means that subsets of the dimensions can be meaningfully visualized together,

unlike the original data space, where there is no known significance to any particular

set of dimensions.

Observe that the Euclidean norm between two data points in this diffusion space

is defined as the difference between the values of the eigenfunctions at those data

points. This metric is called the diffusion distance, Dt.


Dt(xm, xn) = ||Ψ(xm)−Ψ(xn)|| =

√√√√ L∑l=1

λ2tl (φl(xm)− φl(xn))2 (3.10)

The weighted eigenvectors in the diffusion distance are very similar to those in

Eq. (3.8), in which the probability of moving from one point to another in t steps is

calculated from the eigenvectors. Extending from this observation, we see the diffusion

distance can also be interpreted as a weighted norm of the differences between the

probability distributions of the two data points.

Dt(xm, xn) =

√√√√K−1∑p=0

(pt(xp|xm)− pt(xp|xn)

d(xp)

)2

The weighting by d(xp) is necessary to compensate for the absence of the right

vectors νl from Eq. (3.8). A high-level description of the diffusion distance in this

interpretation is that it measures how different xm and xn are as starting points. A

small diffusion distance between two data points means that random walks starting

at each point have a similar set of probabilities for the finishing point. The prob-

ability matrices in Fig. 3.4 reinforce the notion that this is a meaningful metric.

As can clearly be seen, data points within the same structural element have similar

distributions. Diffusion distance will measure these data points as close.

Diffusion distances for the example data sets

Diffusion distances Dt for a few values of t are shown in Fig. 3.5 for the cluster data

set and Fig. 3.6 for the circle data set.

In the cluster case, the conclusions are similar to those from Fig. 3.4, where the

probability matrices for the cluster data set are shown. However, instead of having to

infer the closeness of the data points by looking for similar distributions, the diffusion

distance actually quantifies the closeness directly. So, for t = 100, seen in Fig. 3.5(a),

it can clearly be seen that the clusters all have small diffusion distances within their


1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8


(a) D100

1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8


(b) D500

1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8


(c) D5000

Figure 3.5: Diffusion distance matrices Dt for different values of t for the cluster dataset, where dark color means a short distance.

sets. The same is true for the other hierarchical structures as t is increased in Figs.

3.5(b) and 3.5(c).

In the circle case, the ability of the metric to group the circular distributions is

outlined clearly. If the time parameter is set large enough for the connectivity to

spread sufficiently, then the three circles are completely grouped together and clearly

separated in the distances, as seen for t = 5000 in Fig. 3.6(b).

The separation of the circles can also be clearly seen by visualizing the first three

(non-trivial) eigenvectors, φ1, φ2, and φ3, seen in Fig. 3.2.6. The points in the three

circles have clearly been separated from each other in this 3-D space. Some of the


1 2 3

1

2

3


(a) D100

1 2 3

1

2

3


(b) D5000

Figure 3.6: Diffusion distance matrices Dt for different values of t for the circle dataset, where dark color means a short distance.

characteristics in the original circles are preserved as well. The relative size is clearly

visualized, with cluster 3’s large radius still well represented, cluster 2 the second

largest, and cluster 1 in a tight ball. Also, the circular shape is preserved, most easily

seen for cluster 3.

3.2.7 Diffusion time constant

The diffusion distance exists not only in the spatial dimensions defined by the dif-

fusion maps, but also in the dimension of time, as represented by the parameter t.

While this parameter is key to some of the most unique features of diffusion, most sig-

nificantly the hierarchical organization of structure, there are some situations where

it is undesirable.

One such example would be for clustering in diffusion space. While the diffusion

distance Dt could be used as the metric for clustering, this would require a unique

parameter selection, and even if a reasonable t was selected, fixing the parameter

eliminates its hierarchical properties.

For cases where the parametric diffusion distance is suboptimal, a new and novel

metric, the diffusion time constant τ , is proposed. The metric is defined as the

minimum value of the time parameter t for which the diffusion distance is less than


−0.020

0.020.04

−0.020

0.020.04

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

1

2

3


Figure 3.7: The circle data set plotted in φ1, φ2, and φ3.


or equal to some sufficiently small tolerance δ. The definition can be posed as a simple

optimization problem.

minimize t

subject to Dt(xm, xn) ≤ δ

Because the diffusion distance for any two data points decreases in the time di-

mension (a property from Eq. (3.10)), it is easy to see that this problem is solved

when the diffusion distance is equal to the tolerance level δ. So, the diffusion time

constant τ is the value of t that satisfies this relationship.

Dτ(xm,xn)(xm, xn) = δ (3.11)

Essentially, the diffusion time constant represents the amount of time it takes for

two data points to become indistinguishable starting points in the random walk.

This is more easily visualized in the context of the “Hot Potato” analogy from

Section 3.2.3, in which the probability distribution pt(xq|xp) is seen as the likelihood

of a hot potato getting passed from person xp to xq in t steps. In this context, the

diffusion time constant τ(xm, xn) is the number of tosses it would take before it would

be impossible to determine whether the hot potato started with person xm or xn.

In Section 3.2.3, it was suggested in the context of the analogy that the hot potato

would likely be in the local vicinity for a small number of tosses, and would only move

between poorly connected crowds after a large number of tosses. This also makes a

meaningful statement about the diffusion time constant. Data points within a cluster

will then become indistinguishable starting points before objects across clusters. In

this way, at a high level, the ability of the metric to simultaneously encode hierarchical

data can be understood.

Solving for τ is not entirely trivial. Eq. (3.10) shows that the diffusion distance

Dt is the sum of a series of exponential functions in t. So, solving for a specific value


of t requires solving this sum of exponentials. Unfortunately, sums of exponentials

cannot be directly solved.

Newton’s Method

One common method for finding solutions to sums of exponentials is using Newton’s

method (also called the Newton-Raphson method), which can be used to iteratively

find the zeros of non-linear functions [90].

tn+1 = tn −f(tn)

f ′(tn)

In this notation, f ′(tn) represents the first derivative of the function f in the variable

tn.

By initializing with some arbitrary value t0, this process solves for the zeros of

f(t), often in only a few steps.

Newton’s method can be used to find the diffusion time constant by defining f(t)

from Eq. (3.11) (after squaring both sides of the equation).

f(t) = Dt(xm, xn)2 − δ2

By incorporating Eq. (3.10), all unknowns except for t are eliminated.

f(t) =∑l≥1

λ2tl (φl(xm)− φl(xn))2 − δ2

The zero-crossing for this function f(t) is the diffusion time constant τ , and therefore

applying Newton’s method will solve for τ .

Diffusion time constants for the example data sets

Fig. 3.8 shows the diffusion time constant matrices for both data sets.

For the cluster data set, the diffusion time constants seen in Fig. 3.8(a) clearly


1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8


(a) Clusters

1 2 3

1

2

3


(b) Circles

Figure 3.8: The diffusion time constant matrices for the example data sets, demon-strating that data points within structural elements have small time constants.

group the lowest level clusters together, as seen by the strong block diagonal structure.

The higher level groupings can be seen to a certain extent, especially in the closeness

of clusters 1 and 2 as well as 7 and 8, but how the higher levels organize is not

entirely evident. This is because the diffusion time constant is not meant for visual

interpretation, but rather as a metric for tasks such as clustering, which will be shown

shortly.

The diffusion time constants for the circle data set, seen in Fig. 3.8(b), clearly

groups the circular elements together, though not as drastically as the diffusion dis-

tance for t = 5000 shown in Fig. 3.6(b). While there is, not surprisingly, some

variation within the groups, the time constants between data points within a circle

are less than the time constant with any other data point. This figure is in contrast

to the Euclidean distance (Fig. 3.9), which does not have only small distances within

the circles. This can most obviously be seen in the outer circle (circle 3), where the

distances between data points on opposite sides are larger than distances to any other

data point, even those in the inner circles.


1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8


Figure 3.9: The Euclidean distances for the circle data set, where the data pointswithin the same circle are not close together.

3.2.8 Hierarchical clustering with the diffusion time constant

The diffusion time constant metric is intended for uses where selecting the time pa-

rameter t for the diffusion distance or simply limiting the diffusion distance to one

time scale is undesirable. Hierarchical clustering is one of these uses.

Hierarchical clustering organizes a data set into a tree where each branch repre-

sents a cluster. The data is grouped into low-level clusters, and then the clustered are

subsequently clustered, and so on. This process is represented in the tree structure.

The specific algorithm that will be used in this work is the agglomerative complete-

link clustering algorithm [39]. This algorithm measures the distance between two

clusters as the maximum of the distances between all points within the clusters.

Complete-link clustering typically creates a cleaner structure than single-link cluster-

ing, which uses the minimum of the distances instead. The complete-link algorithm

is as follows:

1. Consider each data point as an object, and measure the distances between all

objects.

2. Find the two objects with the smallest distance between them, and group those

two objects together as a new object.


3. Measure the distances between this new object and all other objects. The

distance is the maximum distance between all data points in the objects.

4. Repeat steps 2 and 3 until only one object remains.

This process is easy to implement, and it can be used in diffusion space by using the

diffusion time constant as the distance metric.

In addition to the obvious organizational advantages, creating a hierarchical clus-

ter can directly lead to applications like active learning [22].

Clustering the example data sets

The hierarchical clusters derived from the complete-link algorithm for the example

data sets can be seen in Fig. 3.10.

The cluster-based data set is particularly interesting (seen in Fig. 3.10(a)). Not

only are the low level clusters well-grouped with only a few errors, but then clusters are

properly organized in the hierarchy as well. Clusters 1 and 2 are grouped together

first, as are 7 and 8. Cluster 3 joins with 1 and 2, and then they are all grouped

together (1, 2, 3, 7, and 8) to form one large structural element. On the other side,

clusters 4 and 6 are grouped together first, and then joined by cluster 5 to form the

other global element. These hierarchies match up exactly with the intuition outlined

previously in section 3.2.1.

The hierarchy for the circle data set (Fig. 3.10(b)) is much simpler, since the hi-

erarchical structure is not as layered, but the three circles are still perfectly clustered,

which is the goal for this data set.

3.2.9 Comparison to other methods

Diffusion is by no means the only approach for identifying the essential structure in a

high-dimensional data set. Here, we will compare diffusion mapping to principal com-

ponents analysis, multidimensional scaling, ISOMAP, and locally linear embedding,

which all attempt to find a low-dimensional space that still captures some element of


1

2

3

4

5

6

7

8


(a) Clusters

1

2

3


(b) Circles

Figure 3.10: Hierarchical trees for both data sets from the diffusion time constants,showing that structural elements at multiple levels are accurately extracted.


the original data’s structure. All of these methods, like diffusion mapping, are also

unsupervised.

Each algorithm takes a different approach to try to solve the same problem, which

is to find a low-dimensional representation that captures the important elements of

the structure of the high-dimensional data set. Because they are different, though,

each has its own share of advantages and disadvantages.

The argument here is that diffusion mapping still offers the most natural definition

of connectivity, that the hierarchical ordering of the dimensions is the most clearly un-

derstandable, and that, unlike the other methods, there is not a fundamental concept

of approximation in its derivation. All of these are seen as advantages.

Principal components analysis

Principal components analysis (PCA) is one of the classic methods of dimensional

reduction for data sets. The method essentially identifies the directions in the data’s

high dimensional space in which the data has the highest covariance. In these direc-

tions, the differences between the data points should be most prominent, and, as a

result, this approach is widely used in applications like machine learning.

However, it is also fundamentally different than diffusion mapping. First of all,

the structure is not built upon connectivity, but rather second-order statistics. As a

result, the principal components are defined purely upon the covariance of the data,

with no concern for the actual structure. Essentially, PCA is built on the assumption

that the data is a large cloud, and it finds the directions over which the cloud is most

spread out. While this is useful for classification tasks, there is no real insight gained

into the actual structure of the data. There is no guarantee that the dimensions with

the highest covariance will also be the dimensions that best represent the structure.

Also, because PCA is a linear projection of the data, structures that require

more than three dimensions to represent (for example, a ribbon that winds through 5

dimensions of space) will not be reshaped to fit cleanly into a smaller set of dimensions.


Classical multidimensional Scaling

Classical multidimensional scaling (MDS) is a non-linear approach, so it will be able to

theoretically reorient high-dimensional structures for low-dimensional visualization.

The algorithm solves a least-squares optimization problem that finds a low-dimensional

data set {..., ym, yn, ...} ∈ Y with similar distances between its data points as the orig-

inal data set {..., xm, xn, ...} ∈ X .

argminY

∑m,n

||(||ym − yn||)− (||xm − xn||)||

MDS does give nice low-dimensional representations of the original data set. How-

ever, it has a few shortcomings in comparison to a method like diffusion mapping.

First of all, it is fundamentally a statistical approximation of the data, in that the

dimensions are defined by the minimal squared error. This is not as natural of an

organization of the dimensions as diffusion mapping, where the dimensions are or-

ganized based on the time parameter of a random walk. Instead, understanding the

organization fundamentally requires an understanding of the error.

Furthermore, squared error emphasizes avoiding large errors. While this is usually

an advantageous characteristic, in this case it will place a higher priority on accu-

racy for large distances, since a small percentage error will be more costly for two

data points that are far apart than two points that are close together. This is ex-

actly the opposite of what we usually desire for dimensional reduction, which is the

preservation of local geometry. Also, MDS is only based on the proximity between

data points, rather than incorporating any kind of connectivity. Diffusion mapping

is fundamentally built off connectivity and local geometry.

Finally, MDS typically defines the relationships between two points based on their

distance, and this is largely inflexible in order to keep the optimization problem

reasonably calculable. In the diffusion process, the affinity between two points is

much more flexible, requiring only symmetry and non-negativity.


Locally linear embedding

Locally linear embeding (LLE) is another non-linear dimensional reduction algorithm,

and it is based on similar approximations to MDS. But, the structure is more highly

prioritized by viewing the data based on an approximation of its local orientations.

The method begins by approximating each data point as a linear combination of

its nearest neighbors. This can be accomplished by solving for the weights Wmn with

a least-squares optimization.

argminWm,n

∑n

||xm −Wm,nxn||

Through the weights Wmn, the entire data set is represented as a collection of linear

functions defined by local geometry. Dimensional reduction can then be accomplished

by finding the best low-dimensional set of data points {..., ym, yn, ...} ∈ Y that share

these local linear relationships.

argminY

∑m,n

||ym −Wm,nyn||

LLE can be seen as a variation of MDS, and, as a result shares many of the

advantages. Additionally, because it fundamentally incorporates local geometry, it

fixes that shortcoming with MDS.

However, it is still built on a statistical approximation, and, in fact, incorporates

two such optimizations. This makes it even more difficult to conceptualize the effects

of the error. And, the linear approximation at a local level, while usually acceptable,

forces an assumption onto the structure of the data, which is that it can be reasonably

modeled at a local level as a linear function. This is not always the case, and no such

assumption is required for diffusion mapping.


Isomap

Isomap is another extension of MDS that incorporates local geometry. However, in

this case, the distance is based off connectivity rather than a locally linear approxi-

mation, as was used in LLE.

The Isomap algorithm begins by determining some limited set of neighbors for

each data point in X . The distance between two points xm and xn, d(xm, xn), is

then defined as the shortest pathway between them that is drawn exclusively through

the neighbors. So, if the two points are neighbors, then d(xm, xn) is the distance

between them. If the two points are not neighbors but they share a neighbor xp, then

d(xm, xn) is defined as the distance from xm to xp plus the distance from xp to xn.

This is extended to however many pathways are necessary to connect the two data

points.

Once these distances are established, then the MDS algorithm is applied with

these distances used as the distance between data points in the original set X .

argminY

∑m,n

||(||ym − yn||)− d(xm, xn)||

The Isomap algorithm is a very elegant way to incorporate connectivity into the

MDS algorithm. However, there are still a few potential drawbacks that remain. The

first is that it is still fundamentally built on a minimization of statistical error, which

is only a drawback in its ability to be understood intuitively.

Another minor difference is that the Isomap process requires a strict definition of

the set of neighbors for a data point, which is necessary because the lack of connectiv-

ity between some data points is what differentiates Isomap from MDS. In the diffusion

case, this sort of strict distinction is not necessary, typically using a smoothly decay-

ing affinity function instead. It is not difficult to envision a situation where a small

change in the strict definition required for Isomapping would significantly change

the distance function d(xm, xn) among the points along the border of the decision

boundary. This problem does not need to be confronted in diffusion mapping.

The final drawback is more subtle, but is grounded in the definition of distance as


the minimal pathway between two data points. This means that there is no concept

of how well-connected two points are, but instead, only whether they are connected.

In the case of diffusion, structure is not only defined by the shortest path between

two data points, but how many short paths there are connecting them. In the case

of Isomap, bringing two data points close together only requires one pathway. This

difference may not matter in most cases, or may even be advantageous to some, but

it is another fundamental difference between Isomaps and diffusion maps.

3.3 Applying diffusion distance to music analysis

Diffusion distance has several features that suggest it fits well with musical organi-

zation and analysis. The connectivity-based, distribution-free metric has promise in

both music analysis and musical organization.

For theoretical analysis, one of the most important characteristics of diffusion

distance is that it is distribution-free, meaning no assumptions are made about the

structure or orientation of the data. This is essential for music analysis, because the

structure and organization of music varies greatly. While traditional Western music

theory does outline a series of relationships used in Western music, they can be used

in numerous ways, as demonstrated by the wide range of music that has been created

while following that theory. Entering into an analysis process with as few assumptions

as possible about the music itself allows for a clean slate and avoids the bias of trying

to fit a music piece into an inappropriate system. It also allows for an individual

work to define its own structure, rather than forcing it to conform to an exterior one.

This trait is useful for any data analysis, but is especially appropriate in the musical

context.

If a musical work is allowed to create its own space, as it is in diffusion, this also

means that the space itself becomes an interesting tool for analysis. Different musical

works drawing from different theoretical backgrounds will create different spaces. The

flexibility of diffusion also allows for two works to be combined in a separate analysis,

giving multiple avenues for analysis and comparison. A long term goal in diffusion

research should be to fully harness this power in the musical domain.


The distribution-free aspect is also important for theoretical analysis because the

theory and rules are well known in a musical space, but translating them into an

arbitrary geometrical space is not trivial. Diffusion distance allows for the use of

different feature sets that project the musical data into any variety of spaces without

requiring any knowledge about the corresponding structural distortions caused by

the projection. This is especially useful in a relatively young field like computational

music analysis, because the ease of the system encourages experimentation with new

feature sets and approaches.

The connectivity of diffusion also has an intuitive analogue in music for connecting

notes, which are chords and temporal proximity. This approach will be fundamental

to the work to follow, and the value of it is that it allows the connections between

the notes to be dictated by musical relationships, and the visualizations that follow

are fundamental to music theory.

Organizing on the local geometry extends its usefulness to databases as well. As is

the case with theoretical geometry, structure from collections of songs is also unlikely

to draw from a known and easily-described distribution.

Finally, the ease of visualization with diffusion is extremely valuable to music

analysis and organization. First of all, visualization has long been a desired and

sought after tool in music for assistance in analysis, new intuitions, and also simply

to create a multimedia musical experience. Also, musical organization is a difficult

prospect. Music makes a layered and multi-structured database, and, as a result, there

are numerous ways to organize a musical database. The visualizations in diffusion can

serve as a valuable tool in assisted searches. By pulling out structures in a database

and organizing them hierarchically, the user’s task of evaluating the database and

finding the desired organization is made much easier.

The remainder of this work will test the validity of some of these applications.

Diffusion maps will be used to build fundamental geometric representations of mu-

sic theory without any prior knowledge required. Hierarchical clustering in diffusion

space will be used to classify real musical scores based on key and meter. And visu-

alization in diffusion space will be used to represent musical excerpts as trajectories

in a custom space built automatically based on the musical structure itself. Finally,


a machine-learning-based extension for dual diffusion will be proposed as a potential

system for user-guided database organization.

Chapter 4

Diffusion-based Music Theory

Analysis

4.1 Introduction

This chapter will focus on creating geometric representations of the fundamentals

of music theory using diffusion spaces built only on simple units like interval and

short rhythmic sequences. This will demonstrate on a basic musical level the ability

of diffusion geometry to identify and extract the foundational aspects of a tonal or

rhythmic sequence and display it in the corresponding visual space. In all cases, the

input will be the simplest musical representation: binary vectors coding only whether

a particular note is on or off at a particular time.

First, we will begin with tonal-based theory, specifically in the realm of intervals.

Pitch-class intervals (collapsing the entire set of musical notes to the 12 notes within

one octave) can be interpreted in a geometrical context, as seen in several of the

note-based visualizations presented in section 2.4.1. In this case, we will use diffusion

maps to visualize the relationships between notes created by the intervals and the

basic shapes to which those relationships correspond.

By expanding the note set to include multiple octaves, it then becomes possible

to recreate versions of the visualizations from section 2.4.1 directly. These high-level

representations of Western tonal theory can easily be constructed with diffusion maps

57

CHAPTER 4. DIFFUSION-BASED MUSIC THEORY ANALYSIS 58

by simply selecting the appropriate intervals to connect the notes.

Metrical structures can also be visualized with diffusion maps. Duple and triple

meter beat trains will not only be separated completely, but also organized into the

simple geometric shapes that correspond to the periodicity of the meter itself: a

square and a triangle. The same process can be performed with similar results for

beat trains that more closely resemble hemiolas, in which 3 beats occur in the time

of 2, or vice versa.

Geometric representations of musical relationships can be a valuable tool for un-

derstanding and communicating the implications and properties of those relation-

ships. Diffusion adds an extra level of simplification by automatically identifying and

extracting those representations from the musical relationships themselves.

4.2 Tonal Theory

Two cases of note sets for tonal theory will be examined separately. First, only pitch

classes will be considered. In this context, all intervals between the limited set of

notes can be separately analyzed and explored. Then, an expanded note set that

spans multiple octaves will be used to recreate helical structures that are historically

relevant to music analysis.

4.2.1 Input Representation

These diffusion spaces are created by following the process outlined in section 3.2.

For input, the full set of notes in use for the given experiment np is combined with

the full set of all occurences of the interval in question iq. To represent the set of

notes combined with the interval i, we will use the notation Xi.

X = {..., np−1, np, np+1, ..., iq−1, , iq, iq+1, ...}

Both the notes and the intervals are represented as binary vectors with one dimension

corresponding to every note in the note set. The note vectors will each be all zeros


except for a single 1 (indicating which note it represents) and all interval vectors will

have exactly two notes active.

4.2.2 Affinity Function

Unless otherwise stated, all experiments utilize the cosine distance affinity function,

shown in Eq. (3.2). This function is optimal because it defines a similarity between

two data points as exclusively based on how many notes they share in common versus

how many notes they could share in common.

k(xm, xn) =total number of notes xm and xn share in common

total number of notes that are active in either xm and xn

This means that two notes will not have any direct affinity, and so any relationship

they have in the experiments is defined completely by their connectivity through the

intervals.

4.2.3 Geometric representations of pitch-class intervals

We begin with an atomic unit of tonal music theory: the interval. The interval is the

number of musical steps between two notes, and discussions of the interval usually

involve notes played in succession or simultaneously.

In this section, only pitch-class intervals will be used, and so the note set will

consist exclusively of the 12 pitch classes {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}. For conve-

nience, for the remainder of this chapter, the pitch classes will be referred to by the

note set where pitch class 0 = C: {C,C],D,D],E, F, F ],G,G],A,A],B}.Ignoring the octave essentially means that the distance between two notes can be

represented in two ways: the distance between the notes, and the distance outside

the notes (where the pitch-class representation wraps back around). The latter will

be referred to as the inversion. The intervals and their inversions for each distance

are shown in Table 4.1. It can easily be seen that all intervals are represented by the

interval/inversion pairs for distances of 1-6 semitones (distances of 7-11 semitones

simply repeat the intervals with the interval/inversion pairing reversed). Note that,


Semitones Interval Inversion

1 minor 2nd (m2) major 7th (M7)2 major 2nd (M2) minor 7th (m7)3 minor 3rd (m3) major 6th (M6)4 major 3rd (M3) minor 6th (m6)5 perfect 4th (P4) perfect 5th (P5)6 tritone (T) tritone (T)7 perfect 5th (P5) perfect 4th (P4)8 minor 6th (m6) major 3rd (M3)9 major 6th (M6) minor 3rd (m3)10 minor 7th (m7) major 2nd (M2)11 major 7th (M7) minor 2nd (m2)

Table 4.1: The pitch-class intervals and their inversions

because of the cyclic nature of the pitch-class set itself, visualizations from interval

distances of 7-11 semitones would exactly replicate its inversion in the 1-6 semitone

range. As a result, only distances up to 6 semitones will be examined in this section.

The exercise of visualizing the intervals is useful for a few reasons. First of all,

demonstrating that diffusion maps graph the pitch-class set in a geometry meaningful

to the interval itself is a strong first step toward showing the value of diffusion geom-

etry to music theory analysis. Secondly, the visualizations themselves can be a more

direct and straightforward way to understand the intervals and the relationships they

create. In many cases, a simple visual representation can immediately communicate

the musical relationships that it would take paragraphs to explain. Clearly, as the

rich history addressed in section 2.4.1 demonstrates, many scholars in the field have

seen this value in visualizations as well.

For notational ease, all notes with accidentals (e.g. C] or D[) will be referred to

in their sharp form. In many of these cases, standard musical notation would more

often use a flat, but for consistency and notational efficiency, only the sharp will be

used. As an example, stepping up a minor 3rd from C moves to E[. This note will be

referred to as D] for ease, even though E[ is the proper functional notation, because

both notations refer to the same pitch class.


−0.2 −0.1 0 0.1 0.2

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

C

C#/Db

D

D#/Eb

E

F

F#/Gb

G

G#/Ab

A

A#/Bb

B


Figure 4.1: The pitch classes connected by semitone intervals plotted in the first twodimensions of the diffusion map.

1 semitone: Minor 2nd or Major 7th

The minor 2nd connects two notes that are adjacent to each other in the chromatic

scale (e.g. C and C]). A noteworthy property of the minor 2nd is that it fully

connects the pitch-class set, in that any pitch class can reach any other pitch class

by only taking steps of one semitone. Furthermore, starting at any note and moving

by one semitone in the upward direction will move through every other note before

returning to the starting point (the same, of course, is true with downward stepping

as well).

These two properties are extremely relevant for diffusion. The connectivity of the

interval means that there will only be one stationary distribution for the input set

Xm2, which is built from the pitch classes and semitone intervals. The property of

unidirectional movement passing through all pitch classes and then returning to the

original note would also suggest a cyclic or circular shape would be an appropriate

geometric representation.

Fig. 4.1 shows the first two dimensions of the diffusion map, with the minor 2nd


0.120.140.160.180.20.220.240.26

−0.3

−0.2

−0.1

0

0.1

0.2

−0.4

−0.2

0

0.2

0.4

C

C#/Db

D

D#/Eb

E

F

F#/Gb

G

G#/Ab

A

A#/Bb

B


Figure 4.2: The pitch classes connected by major 2nd intervals plotted in the firstthree dimensions of the diffusion map.

intervals represented by the lines connecting the notes. The shape clearly takes on a

cyclic shape, as expected, and immediately communicates how the pitch classes are

organized in the context of a semitone: moving in any one direction will pass through

all pitch classes before returning to the starting point.

2 semitones: Major 2nd or Minor 7th

The major 2nd, unlike the semitone, does not fully connect the pitch-class set. Instead,

the notes are broken into two separate and exclusive sets: {C,D,E, F],G],A]} and

{C],D], F,G,A,B}. Within these subsets, however, it is once again the case that

unidirectional movement will pass through all notes (within the subset) before re-

turning to the starting point. The subsets, as they were previously written, show the

order of this path.

The existence of two exclusive subsets should be prominently represented in any

visual manifestation of the intervallic relationship. Fig 4.2 shows the first 3 dimensions

of the diffusion map created for the data set XM2, consisting of the pitch classes and all


−0.20

0.2−0.2 0 0.2

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

C

C#/Db

D

D#/Eb

E

F

F#/Gb

G

G#/Ab

A

A#/Bb

B


Figure 4.3: The pitch classes connected by major 2nd intervals plotted in dimensions2, 3, and 4 of the diffusion map.

major 2nd intervals. The two subsets have been completely separated. This separation

is seen in the highest ranking dimensions because this aspect of the geometry is the

most fundamental to the representation.

In higher dimensions of the diffusion map (corresponding to higher eigenvectors, or

smaller eigenvalues), the expected cyclical shapes are seen (Fig. 4.3), this time hexag-

onal (since there are six elements in each subset). Interestingly, the two hexagons are

oriented in different directions, further illustrating the separation of the two subsets.

It is necessary to move to higher eigenvectors to see this structure in this shape be-

cause the first two eigenvectors necessarily represent the two stationary distributions,

one for each subset.

The geometric representations communicate very clearly and intuitively that the

two sets are completely separate, and that there is no way to move between the sets

using only the major 2nd interval. At the same time, the proper structure within each

subset is also clearly communicated.


−0.2−0.1

00.1

0.2

−0.10

0.10.2

0.3

−0.2

0

0.2

C

C#/Db

D

D#/Eb

E

F

F#/Gb

G

G#/Ab

A

A#/Bb

B


Figure 4.4: The pitch classes connected by minor 3rd intervals plotted in the firstthree dimensions of the diffusion map.


!0.20

0.2 !0.2!0.100.10.2

!0.4

!0.3

!0.2

!0.1

0

0.1

0.2

0.3

0.4

C

C#/Db

D

D#/Eb

E

F

F#/Gb

G

G#/Ab

A

A#/Bb

B


Figure 4.5: The pitch classes connected by minor 3rd intervals plotted in dimensions3, 4, and 5 of the diffusion map.

3 semitones: Minor 3rd or Major 6th

The minor 3rd interval also divides the pitch-class set into a group of subsets, this

time three subsets with four elements each: {C,D], F ], A}, {C],E,G,A]}, and

{D,F,G],B}. Fig. 4.4 shows clearly that the diffusion space separates these three

subsets with the prominent eigenvectors.

The subsets, which are cyclic within themselves like the previous two examples,

have four elements each, indicating that the proper geometric representation should

be a square for each subset. Fig. 4.5 shows these squares, visualized in dimensions 3,

4, and 5 of the diffusion map. The first two dimensions, plus the eigenvector φ0 which

is excluded from the map (see section 3.2.6 for an explanation), combine to represent

the three stationary distributions that exist due to the three separate subsets within

the pitch-class set.

And, like the hexagons for major 2nd intervals, these squares should vary in their

directional orientation to represent the separation of the subsets. This can be seen


!0.2

0

0.2

!0.2

0

0.2

!0.4

!0.2

0

0.2

0.4

C

C#/Db

D

D#/Eb

E

F

F#/Gb

G

G#/Ab

A

A#/Bb

B


Figure 4.6: The pitch classes connected by minor 3rd intervals plotted in dimensions3, 4, and 5 of the diffusion map, viewed from a different angle than Fig. 4.5.


−0.3−0.2

−0.10

0.1 −0.2−0.1

00.1

0.2

−0.2

0

0.2

C

C#/Db

D

D#/Eb

E

F

F#/Gb

G

G#/Ab

A

A#/Bb

B


Figure 4.7: The pitch classes connected by major 3rd intervals plotted in the firstthree dimensions of the diffusion map.

somewhat in Fig. 4.5, but a change in angle shows the orientation of the three squares

even more clearly, as seen in Fig. 4.6. Here, it is very clear that the orientation

of the squares represents the exclusivity of the subsets, though the actual relative

orientation of the squares is not entirely obvious, due to the necessary projection

onto a two-dimensional plane for plotting. In the three-dimensional space, the three

squares are oriented similarly to the x-, y-, and z-planes in a cartesian coordinate

system.

4 semitones: Major 3rd or Minor 6th

The major 3rd interval divides the pitch-class set into four subsets of three elements

each: {C,E,G]}, {C], F,A}, {D,F],A]}, and {D], F],B}. As in the previous cases,

these graphs are exclusive of each other, but they are also fully and cyclically con-

nected within the subset.

Because the subsets now have only three elements each, the desired geometrical


−0.20

0.2−0.2

00.2

−0.2

−0.1

0

0.1

0.2

0.3

0.4

C

C#/Db

D

D#/Eb

E

F

F#/Gb

G

G#/Ab

A

A#/Bb

B


Figure 4.8: The pitch classes connected by major 3rd intervals plotted in dimensions4, 5, and 6 of the diffusion map.

representation would be a series of separate and exclusive triangles.

Fig. 4.7 shows that, once again, the subsets are immediately and clearly separated

in the first few dimensions of the diffusion map. Because there are now four stationary

distributions necessary, one for each subset, we must go all the way to dimensions

4, 5, and 6 of the diffusion map to see the geometrical structure of the intervallic

relationship, seen in Fig. 4.8. Here, four triangles on prominently on display, as

expected.

Unlike the previous examples for the major 2nd and minor 3rd, these triangles are

not oriented as cleanly with relation to each other as the hexagons in Fig. 4.3 or

the squares in Fig. 4.6. This is likely because there are four triangles, or one more

than the three dimensions that we can visualize. It is mathematically impossible to

orient four planes orthogonally in three-dimensional space. As a result, all we can

visualize is some three-dimensional projection of the four-dimensional space in which

these triangles are orthogonal to each other.


−0.2 −0.1 0 0.1 0.2

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

C

C#/Db

D

D#/Eb

E

F

F#/Gb

G

G#/Ab

A

A#/Bb

B


Figure 4.9: The pitch classes connected by perfect 4th intervals plotted in the firsttwo dimensions of the diffusion map.

Note that the apparent distortion of some of the triangles (such as that for the

subset {C,E,G]}, shown in the darkest black) is only because of the angle of perspec-

tive on the space. If viewed from an angle normal to the triangle’s plane, all triangles

appear more like the triangle for subset {D],G,B}, shown in the lightest gray.

5 semitones: Perfect 4th or Perfect 5th

Like the minor 2nd, the perfect 4th fully connects the pitch-class set. In other words,

starting from any note in the set, any other note can be reached by moving only

through perfect 4th intervals. This means that we should, in concept, have a very

similar geometric representation for the perfect 4th as was seen in Fig. 4.1 for the

minor 2nd, except with the pitch classes located differently to reflect the different

intervallic relationship.

In Fig. 4.9, we see that this is indeed the case in the diffusion space created for the

data set XP4. Instead of creating a chromatic circle, for this set, the representation is

the circle of fifths, a commonly used orientation in music theory. Because the graph


−0.4

−0.2

0

−0.2 0 0.2

−0.2

0

0.2

C

C#/Db

D

D#/Eb

E

F

F#/Gb

G

G#/Ab

A

A#/Bb

B


Figure 4.10: The pitch classes connected by tritone intervals plotted in the first threedimensions of the diffusion map.

is fully connected, this graph is displayed by the first two dimensions of the diffusion

map, demonstrating its prominence.

6 semitones: Tritone

The tritone is a critical interval for Western music. The tension created by a tritone,

and the movement toward resolution of that tension, is one of the primary driving

forces in harmonic progressions. However, despite this significant musical role, the

underlying geometry is far less exciting. The tritone breaks the pitch-class set into six

separate subsets: {C,F]}, {C],G}, {D,G]}, {D],A}, {E,A]}, and {F,B}. Because

there are only two elements in the subsets, they necessarily should create lines in the

geometric representation.

In Fig. 4.10, the diffusion space for this case is shown. Clearly, the subsets are all

divided, and the lines create the expected, albeit unexciting, geometric representation.


−0.4−0.2

00.2

0.4

−0.4

−0.2

0

0.2

0.4−0.2

00.2

CC#/DbDD#/EbEFF#/GbGG#/AbAA#/BbB


(a) Minor 3rd Squares

−0.2 −0.1 0 0.1 0.2

−0.2

−0.1

0

0.1

0.2

0.3−0.5

0

0.5



(b) Major 3rd Triangles

−0.2 −0.1 0 0.1 0.2−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2



(c) Perfect 4th Circle (Circle of fifths)

Figure 4.11: Several geometric representations of intervals appear in the diffusionspace created with the major chord.

Major Triad

To show how these interval representations appear in more complex musical struc-

tures, let us use a major triad as an example. The intervallic content of a major triad

(e.g., C −E −G) includes a major 3rd, a minor 3rd, and a perfect 5th. By creating a

data set consisting of the pitch-class set and all 12 major chords, we are essentially

combining the data sets from the experiments for those three intervals.

Fig. 4.11 shows a few of the diffusion space dimensions for this data set, and

all three expected geometric shapes can easily be found: the minor 3rd squares (Fig.


4.11(a)), the major 3rd triangles (Fig. 4.11(b)), and the perfect 4th circle.

An interesting observation that can be made from these shapes is how the inter-

play between the major and minor 3rd is represented. Notice that, in Fig. 4.11(a),

the minor 3rd squares or oriented in such a way so that the major 3rd subsets are

represented in the columns formed by their corners. The opposite is true for the ma-

jor 3rd triangles in Fig. 4.11(b), where their corners create columns of the minor 3rd

subsets. In this way, even the relationship between the two intervals is represented

in the diffusion space.

And so, through this example, we see that the intervallic relationships present

themselves in the geometry of musical elements that are built from those intervals,

and that an additional layer of information is added by linking the intervals together

in a meaningful way.

4.2.4 Recreating note-based visualizations

Now that we have established the effectiveness of diffusion for creating simple ge-

ometric visualizations of simple musical intervals, we can try to create some more

sophisticated representations. In section 2.4.1, several historical visualizations of the

notes are reviewed and discussed. These visualization schemes are the result of careful

thought and work by their creators with the intended goal of graphically representing

important relationships in music theory. Yet, with diffusion, we can easily create

these same visualizations in diffusion space with only a few intervals. In a sense,

many historical visualizations are special-case diffusion spaces derived from the spe-

cific intervals considered essential to those organizations.

However, in order to accomplish this, one issue must be addressed. Many of

the visualizations previously discussed require a single pitch class to occupy multiple

spaces at the same time. In a theoretical representation, this is not a problem, because

it can simply be accepted that each location is equal to the others, and any of the

infinite repetitions can be arbitrarily selected. However, in diffusion space, this is not

possible. Each data point is located based on the output of the eigenfunctions. These

eigenfunctions necessarily can only give one output for a single input, and therefore


−0.1−0.05

00.05

0.1

−0.05

0

0.05

−0.05

0

0.05


(a) Pitch classes only

−0.1−0.05

00.05

0.1

−0.050

0.05

−0.05

0

0.05


(b) Minor 2nd intervals connected

Figure 4.12: Shepard’s chromatic helix in diffusion space, resulting from the combi-nation of minor 2nd and octave intervals with the full note set, with and without theminor 2nd intervals drawn in.

a note in diffusion space cannot simultaneously occupy multiple locations.

As was previously discussed, the use of helices and spirals in music theory ge-

ometry was intended to alleviate this issue. Chew’s Spiral Array [13], for example.

is the Tonnetz [27, 57] wound into a spiral. So, we would expect to get a similar

representation in diffusion space.

However, while the use of helices and spirals removes the redundancy in the polar

dimension that moves around the structure, it does not remove redundancies up and

down the helices. We will address this problem here by including octaves in the

note set. In the case of Shepard’s chromatic helix and double helix [69], octave, or

pitch height, is already included in the representation, so this addition is consistent

with the original work. However, in the case of Chew’s Spiral Array, the space

described does not include pitch height, so that will be added artificially for the

diffusion representation to avoid the intractable problem of placing the same object

in multiple locations.


−0.1−0.05

00.05

0.1

−0.1−0.05

00.05

0.1−0.01

0

0.01

0.02

0.03

0.04


Figure 4.13: Zooming in on two octaves of the chromatic helix from Fig. 4.12(b).


Shepard’s Chromatic Helix

Shepard’s chromatic helix is fundamentally built on two intervals. The first is the

major 2nd, which is the interval that steps around the helix. The second interval is

the octave, which establishes the period of a cycle around the helix. As seen in Fig.

shepard, the chromatic helix lines up the octave vertically. In the representation, this

alignment also means that collapsing down the octaves projects onto the chromatic

circle (which was the diffusion-derived representation for the minor 2nd in Fig. 4.1).

So, by organizing the notes in the note set (built with multiple octaves), we find

the organization in Fig. 4.12(a), where the notes are organized into some sort of

cylindrical shape. The vertical axis also clearly separates based on pitch height.

However, it is difficult to visually diagnose the representation in any finer detail.

To assist in understanding, Fig. 4.12(b) connects the minor 2nd intervals. This

clarifies the picture a great deal, as the intervals form an ascending helix around

the outside. By simply using the minor 2nd and octave for an organization, we have

created Shepard’s chromatic helix. For easier visualization of this, Fig. 4.13 zooms

in on two octaves from the middle of the helix. In this zoomed view, it is easy to see

that the helix is beautifully recreated.

This approach also confirms that Shepard’s proposed visualization is the fun-

damental representation for a musical system based on the minor 2nd and octave.

However, Shepard clearly recognized that this wasn’t the only way to view the rela-

tionships of the notes, as he also proposed the double helix.

Shepard’s Double Helix

Shepard’s double helix is based on the perfect 5th, major 2nd, and octave. The octave

once again establishes the period of a helical cycle and defines the vertical axis. The

major 2nd is fundamental to each of the two helices, as that is the interval that each

covers with each step. And the representation is designed so that collapsing the octave

projects the helices onto the circle of fifths (establishing the importance of the perfect

5th).

To create this representation in diffusion space, the octave is obviously necessary.


−0.050

0.05

−0.050

0.05

−0.05

0

0.05


(a) Pitch classes only

−0.1

0

0.1

−0.050

0.05

−0.05

0

0.05


(b) Major 2nd intervals connected

Figure 4.14: Shepard’s double helix in diffusion space, resulting from the combinationof perfect 5th and octave intervals with the full note set, with and without the major2nd intervals drawn in.

The perfect 5th also must be included, it would seem, in order to establish the proper

circular orientation (as was the case with the chromatic helix and the minor 2nd.

However, as it turns out, the major 2nd is unnecessary to create the visualization.

The implications of this will be discussed shortly.

The diffusion space created with the octave and the perfect 5th is shown in Fig.

4.14(a). The major 2nd intervals are connected in Fig. 4.14(b), visualizing the double

helix shape that is created. To more clearly see it, the representation is once again

zoomed to two octaves in Fig. 4.15. Notice that the “helices” could more accurately

be described as hexagons (the fundamental geometric representation of the major 2nd,

shown in diffusion space in Fig. 4.3) stretched in the vertical dimension.

However, it is important to consider that we were able to construct this shape

without including the major 2nd, which, in Shepard’s original formulation, played a

central role in the representation. The reason for this is that the double helix is not

essential to creating this organization of the notes. In fact, if the perfect 5th were re-

placed with the major 2nd, the two helices would exist completely separately, because

there would be no relationship between them. If the helices are to be intertwined, as

is the case in Shepard’s original representation, than the major 2nd is insufficient for


−0.1−0.05

00.05

0.1

−0.1−0.05

00.05

0.1−0.01

0

0.01

0.02

0.03

0.04


Figure 4.15: Zooming in on two octaves of the double helix from Fig. 4.14(b).


constructing the proper note organization. Rather, it is simply a way to interpret it.

In fact, this organization of the notes can be interpreted in several ways. A few of

these interpretations are shown in Fig. 4.16. We have already seen that it can be seen

as two double helices ascending by major 2nd intervals. However, it is also possible

to view it as 5 separate helices ascending by perfect 4th intervals (Fig. 4.16(a)), or 7

separate helices ascending by perfect 5th intervals (Fig. 4.16(b)). It is even possible

to interpret the space as 12 helices ascending by octaves (Fig. 4.16(c)).

So, we can now see that Shepard’s double helix is really an interpretation of a

note organization, which is itself based on the octave and perfect 5th interval. And it

was through diffusion analysis that we learned this distinction.

Chew’s Spiral Array

Recreating Chew’s Spiral Array is slightly more challenging. This is because the

pitch height (which must be included to constrain one note to one location) does not

incorporate quite as cleanly as it does in Shepard’s helical representations. However,

the proper intervals can be derived with a little understanding of the space in which

the Spiral Array exists.

First of all, the spiral ascends by perfect 5th intervals, with four steps per cycle.

In the array, this leads to a vertical jump by a major 3rd. However, in our space, in

which octave must be accounted for, this vertical jump actually must be by a major

3rd plus two octaves (which is 28 semitones, or four steps of 7 semitones). So, in the

note organization of the Spiral Array adjusted to include pitch height, the vertical

interval is a major 3rd plus two octaves.

It seems obvious that the second fundamental interval would be the perfect 5th,

since that is the unit for the spiral’s ascension. However, as it turns out, this is not

the correct approach. When the pitch height is added, the note space in which the

Spiral Array exists is actually broken into 7 Spiral Arrays. This is because, as was

the case with the major 2nd in Shepard’s double helix, the combination of the perfect

5th and major 3rd plus two octaves does not fully connect the note set. So, with the

perfect 5th interval as the building block, the note set would be broken into exclusive

subsets.


−0.1

0

0.1

−0.050

0.05

−0.05

0

0.05


(a) Perfect 4th helices

−0.1

0

0.1

−0.050

0.05

−0.05

0

0.05


(b) Perfect 5th helices

−0.1

0

0.1

−0.050

0.05

−0.05

0

0.05


(c) Octave helices

Figure 4.16: Several other interpretations of the note organization in which Shepard’sdouble helix exists.


−0.0500.05−0.05 0 0.05

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

0.08


Figure 4.17: The diffusion space created with minor 2nd and two octave plus major3rd intervals with an approximation of the Spiral Array represented by the lines.


−0.4 −0.2 0 0.2 0.4−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

C

C#/Db

D

D#/Eb

E

F

F#/Gb

G

G#/Ab

A

A#/Bb

B


(a) φ1 and φ2

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

−0.4−0.200.20.4−0.5

0

0.5

C

C#/Db

D

D#/Eb

E

F

F#/Gb

G

G#/Ab

A

A#/Bb

B


(b) φ3, φ4, and φ5

Figure 4.18: The Krumhansl-Kessler key space from Fig. 2.6(a) remade with diffusion.Dots represent major keys and circles represent minor keys.

So, like in the case of Shepard’s double helix, we must instead construct a space

in which the Spiral Array can exist. This can be accomplished with the minor 2nd,

creating one external spiral that ascends chromatically. By adding in the major 3rd

plus two octaves to align the spiral vertically, a space is created in which all 7 Spiral

Arrays can exist intertwined with each other. Fig. 4.17 shows this note space with

one of the Spiral Arrays drawn in by connecting the perfect 5th intervals, starting at

an arbitrary note. The spiral appears as a square because a cycle in the array only

takes four steps.

In this case, the addition of the pitch height complicated the formulation of Chew’s

Spiral Array, which is quite elegant in the pitch-class framework. However, even with

the added complication, the visualization can still be easily discerned and more deeply

understood through the diffusion formulation.

Krumhansl and Kessler’s Key Space

As an exercise, Krumhansl and Kessler [46] used Multi-Dimensional Scaling (MDS)

to create a low dimensional visualization for the keys based on the K-K key profiles.

This key space can be seen in Fig. 2.6(a).

In Fig. 4.18, the same low dimensional spaces are shown in the first several


−0.01 −0.005 0 0.005 0.01−0.01

−0.008

−0.006

−0.004

−0.002

0

0.002

0.004

0.006

0.008

0.01


Figure 4.19: Duple-meter beat trains separated completely from triple meter beattrains and organized into a square.

dimensions of the diffusion maps for the K-K key profiles. The first two dimensions

(Fig. 4.18(a)) form the exact same space in which majors and minors separated by a

major 2nd are grouped together in perfect 5th circles. The second grouping occurs in

the next three dimensions, where keys separated by major 3rd intervals are grouped

together, with each group consisting of only major or minor keys.

4.3 Metrical Structure

Geometric representations for metrical structure have not seen the same attention as

their tonal counterparts. However, metric components of music are equally relevant

to music understanding, and, as will be seen, we can organize them with diffusion in

equally insightful and interesting ways.


−0.015 −0.01 −0.005 0 0.005 0.01

−10

−5

0

5

x 10−3


Figure 4.20: Triple-meter beat trains separated completely from duple meter beattrains and organized into a triangle.


4.3.1 Metric geometry

The first metric experiment compares beat trains based on different meters. For this,

the data set is built of a group of beat trains. The affinity between two beat trains

is defined by the number of downbeats that overlap when the two trains are aligned

beat-for-beat.

The beat trains themselves are built on periodicities of 2, 3, 4, 6, 8, or 9 beats.

This gives a good representation for duple (2, 4, and 8 beat periodicities) and triple

(3, 6, and 9 beat periodicities) meters.

By building a graph on these beat trains with the affinity described above, several

interesting results are found. The first is that the duple and triple meters are com-

pletely separated in the eigenvectors. This is to say that there are several eigenvectors

where all duple meter beat trains have a non-zero value while all triple meter beat

trains have a zero value, and other eigenvectors where the opposite is true. These

eigenvectors could be used for metric classification.

This revelation alone is quite significant. The system used was given no clues as

to what is significant about these beat trains or how they can be organized or dis-

tinguished. The only information used was how many downbeats overlap. However,

through the connectivity-based analysis metric of diffusion, the two metric founda-

tions are separated completely.

How they are separated is also quite interesting. The first two eigenvectors that

identify duple meter are shown in Fig. 4.19. They clearly organize the duple meter

beat trains into a square. In Fig. 4.20, the first two eigenvectors in which the triple

meter beat trains are organized are seen to orient them into a triangle. These two

shapes are quite interesting, because they geometrically represent the mathematical

foundation of the metrical systems themselves.

Triple meter is based on periodic units of three, which geometrically most obvi-

ously correlates with a triangle. Duple meter is based on periodic units of two, though

the most common form is based on units of four, which is most closely connected to

the geometric square.

This is a very impressive example of the organizational abilities of diffusion map-

ping. By only presenting the analysis system with number of overlapping downbeats


for beat trains, the beat trains are separated into metrical units and then organized

into shapes that embody the metrical foundation on which they were built. It shows

the power of diffusion as a analytic tool for music theory that it is possible to extract

all of this without any prior knowledge of the data incorporated into the analysis at

all.

4.3.2 Visualizing hemiolas

A similar experiment can be repeated with a hemiola-based data set. In this example,

the data once again consists of beat trains that are built on periodicities of 2, 3, 4,

6, or 8 beats (9 beat periodicities were excluded in this case). And, the affinity was

once again defined by the number of overlapping downbeats shared in common by

two beat trains.

However, the key difference between this experiment and the previous metrical

experiment is that here all beat trains were considered to be one measure long. In

the metrical case, the beats were aligned, and so this meant that measures were not

necessarily aligned. In this case, measures are aligned, which will mean that beats will

not necessarily align. This experiment is hemiola-based because it compares different

ways of breaking down a single measure into a varying number of units.

We see in Figs. 4.21 and 4.22 that, despite this different approach to aligning the

beat trains, the results are quite similar. Once again, the beat trains that divide the

measure into units based in 2 are organized into a square, and those that divide the

measure into units based on 3 are in a triangle.

However, one noteworthy difference between the metric case and these hemiola

results is that in the metric case, the duple meter beat trains were all located at (0, 0)

in the triple meter organizations, and vice versa for the duple meter organizations.

However, in these hemiola graphs, the opposite is true. The shape of one type of beat

train forces the other type of beat train to its external corners rather than into the

center. The reasons for this difference are not entirely clear.


−0.02−0.015−0.01−0.005 0 0.005 0.01 0.015−0.01 0 0.010.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02


Figure 4.21: Hemiolas based in units of 2 shaped into a square, similarly to the metriccase shown in Fig. 4.19.

−0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04−0.04

−0.03

−0.02

−0.01

0

0.01

0.02


Figure 4.22: Hemiolas based in units of 3 shaped into a triangle, similarly to themetric case shown in Fig. 4.20.


4.4 Conclusion

In this chapter, we examined the diffusion maps created for several atomic units of

music theory. By starting with the simplest and most fundamental units of music

theory, we are able to build a foundation for understanding of musical relationships

with diffusion.

First, the geometry created with the intervallic relationships within the pitch-class

set was graphed. With lessons learned from these initial experiments, it then became

possible to recreate significant historical note-based visualizations using only a few

simple intervals. This not only gave insight into the process of diffusion mapping for

music, but also into the visualizations themselves. Finally, we looked at rhythmic

organizations, where it was shown that diffusion maps can extract high-level musical

information from even the simplest of inputs.

While these examples, by design, limit themselves to simple musical concepts, the

tasks on display are by no means trivial. Organizing intervals into their fundamen-

tal geometry (with that geometry hierarchically organized into the most prominent

eigenvectors) is a valuable tool. Completely and automatically separating beat trains

into their metrical units is another significant accomplishment. However, in all of

these cases, the most promising aspect is the ability to extract and communicate fun-

damental insights into the mathematical patterns behind the music theory. This is an

extremely interesting prospect, and shows that there is great potential for diffusion-

based analysis in the world of music theory.

Chapter 5

Diffusion-based Musical

Applications

5.1 Introduction

The visualizing and analytic capabilities of diffusion mapping lend themselves to inter-

esting applications in computational music analysis and music information retrieval.

Here, we will examine the role diffusion can play in enhancing key finding, meter

induction, and music visualization.

During the course of this exploration, several machine learning algorithms will

be demonstrated in diffusion space, showing that this approach provides an effective

front-end addition to high-level machine learning tasks.

5.2 Key Finding

Defining the key of a musical excerpt, whether symbolic or in an audio clip, is a highly

sought after task. Teaching a computer to understand musical key is a profound first

step towards teaching a computer to understand music in general.

With diffusion, it is possible to create distributional or functional key-finding

algorithms, depending on what sort of features are used for relating the musical

excerpts. And, in fact, we will first show that the fundamental bases of some of the

88

CHAPTER 5. DIFFUSION-BASED MUSICAL APPLICATIONS 89

most popular algorithms for each approach can be easily derived from the diffusion

time constant.

For all key-finding examples, Bach’s The Well-Tempered Clavier, Books 1 and

2, are used. This database was selected for several reasons. First of all, it gives

nearly equal treatment to all keys, with two preludes and two fugues in each key.

Also, the even distribution of fugues and preludes offers a good mix of harmonic

styles. Finally, these pieces, and much of Bach’s music in general, provide excellent

examples of Western music theory in real scores with a master’s adherence to the

harmonic rules while still offering a great deal of variety.

5.2.1 Key-Finding Characteristics from the Diffusion Time

Constant

This first experiment tries to extract some understanding of key in Western music by

using the full database to organize the 12 pitch classes in a diffusion space.

First, the whole database is transposed into the same key and separated into major

and minor subsets. Then, all unique combinations of notes (intervals and chords) are

extracted and counted for each subset. Using this information, the twelve pitch-

classes are organized into two separate diffusion spaces, one for major keys and one

for minor keys. This process is very similar to the method used to create geometric

interval representations in Section 4.2.3.

The diffusion time constant matrix τ for these two organizations is shown in Fig.

5.1. Analyzed in this form, there isn’t too much to be concluded or learned. It is

clear that the notes 1 and 3 semitones (minor 2nd and minor 3rd, respectively) from

the tonic are furthest removed from the other pitch classes in the major key (Fig.

5.1(a)). 6, 8, and 10 are also somewhat distanced, and this now leaves us with only

the major scale.

In the minor subset (Fig. 5.1(b)), 2 and 6 are most removed, with 4, 9, and, to a

lesser extent, 11 removed, this time deriving the minor scale.

It is noteworthy that the least removed pitch class of those outside the scale (6

for major and 11 for minor) serves a relevant role in their respective modes. In the


(a) Major (b) Minor

Figure 5.1: Diffusion time constants between the pitch classes for major and minorkeys.

major case, the tritone of the tonic (6 semitones away) often appears in the secondary

dominant. And, in minor modes, the leading tone (11 semitones above the tonic or

one below) is often used in cadences and other moments when a particularly strong

movement to the tonic is desired.

So, some musical understanding can be gleaned from the time constants in the

matrix view. However, plotting this same information in a few slightly different

contexts can yield much deeper insight.

Rare-interval interpretations

Fig. 5.2 shows the relative time constants for different intervals in major keys. So, in

Fig. 5.2(a), the time constants between all pitch classes with only 1 semitone between

them are shown. The labels show the root of the interval, meaning, for example, that

the bar labeled 2 shows the time constant between pitch class 2 and the pitch class

one interval up from 2. The corresponding plots for the minor keys can be seen in

Fig. 5.3.

These plots tell a much more interesting story regarding the keys, especially in the

context of rare intervals and functional key finding. A rare interval, generally, is one


0 1 2 3 4 5 6 7 8 9 10 110

0.2

0.4

0.6

0.8

1

1.2


(a) Minor 2nd0 1 2 3 4 5 6 7 8 9 10 110

0.2

0.4

0.6

0.8

1

1.2


(b) Major 2nd

0 1 2 3 4 5 6 7 8 9 10 110

0.2

0.4

0.6

0.8

1

1.2


(c) Minor 3rd

0 1 2 3 4 5 6 7 8 9 10 110

0.2

0.4

0.6

0.8

1

1.2


(d) Major 3rd0 1 2 3 4 5 6 7 8 9 10 110

0.2

0.4

0.6

0.8

1

1.2


(e) Perfect 4th

0 1 2 3 4 5 6 7 8 9 10 110

0.2

0.4

0.6

0.8

1

1.2


(f) Tritone

Figure 5.2: Diffusion time constants between notes separated by various intervals forthe major subset.

0 1 2 3 4 5 6 7 8 9 10 110

0.2

0.4

0.6

0.8

1

1.2


(a) Minor 2nd

0 1 2 3 4 5 6 7 8 9 10 110

0.2

0.4

0.6

0.8

1

1.2


(b) Major 2nd0 1 2 3 4 5 6 7 8 9 10 110

0.2

0.4

0.6

0.8

1

1.2


(c) Minor 3rd

0 1 2 3 4 5 6 7 8 9 10 110

0.2

0.4

0.6

0.8

1

1.2


(d) Major 3rd0 1 2 3 4 5 6 7 8 9 10 110

0.2

0.4

0.6

0.8

1

1.2


(e) Perfect 4th

0 1 2 3 4 5 6 7 8 9 10 110

0.2

0.4

0.6

0.8

1

1.2


(f) Tritone

Figure 5.3: Diffusion time constants between notes separated by various intervals forthe minor subset.


that occurs only in specific key-dependent situations and therefore can be used in key

identification. In these diffusion time constant plots, a rare interval could be defined

as one that has a few small time constants paired with mostly large time constants,

indicating that the interval has only a few situations where it is highly relevant to the

key and otherwise is highly unusual. Looking at the plots in Figs. 5.2 and 5.3 with

this criteria, we can draw some interesting conclusions.

First, the tritone, which was one of the motivations for the rare-interval theory,

is not particularly effective. The tritones with small diffusion time constants are not

drastically different than the others, and, more importantly, the tritone is cyclic in 6

semitones, meaning the diffusion time constants repeat in the plot. This is intuitively

obvious, since the tritone bisects the octave, so, in the pitch-class set, the tritone

{B,F} is identical to the tritone {F,B}. Unfortunately, this means there are only

6 tritone intervals possible, and the tritone in C major is indistinguishable from the

tritone in F] major (not to mention c minor and f] minor). Additionally, the relative

locations of the characteristic intervals in the plots for the major and minor keys are

separated by a minor 3rd, suggesting there may also be some confusion between the

relative major and minor keys. This analysis suggests that the tritone may not be

the best interval for rare-interval key finding.

Another interval that, at first glance, shows potential for key finding is the major

3rd. In both the major and minor subsets, there are only three intervals with small

diffusion time constants, and the rest are quite a bit larger. However, comparing the

case for major keys (Fig. 5.2(d)) with minor keys (Fig. 5.3(d)) reveals that the two

are very similar, offset by a minor 3rd. This suggests that the major 3rd interval would

struggle to distinguish relative major and minor keys. However, the potential for the

interval to distinguish all other keys promotes it to an interval worth investigating.

The final interval that appears to have potential for rare-interval key finding,

according to these plots, is the minor 2nd. In both major and minor keys, the intervals

common to the key are separated by a non-tritone distance, so no two keys will have

the same profile. However, like the major 3rd, comparing the major and minor keys

reveals that, in both cases, the small diffusion time constants are oriented the same

distance apart. Once again, the relative major and minor keys will likely be confused.


0 1 2 3 4 5 6 7 8 9 10 110

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6 7 8 9 10 110

0.2

0.4

0.6

0.8

1

Diffusion−basedK−K Major Profile

Diffusion−basedK−K Minor Profile


Figure 5.4: Key profiles derived from the diffusion time constant compared to theK-K key profiles for major (top) and minor (bottom) keys.

The possibility of using the minor 2nd for rare-interval key finding has previously been

suggested as well [44].

All of these intervals show promise but also potential drawbacks. We will examine

their effectiveness for key finding in diffusion space in Section 5.2.2.

K-K key profile interpretations

We can also compare the first row of the τ matrices to the K-K key profiles, since, in

this context, these two metrics are trying to measure the same relationship between

the pitch classes and the tonic. In order to put these two metrics on the same

scale, we will actually compare the K-K profiles to the negative exponential of the

time constant e−τ . This inverts the time constant’s orientation, so larger is a more

significant relationship and smaller is less significant, as was the case for the K-K key

profiles.

The comparison of these two metrics can be seen in Fig. 5.4 (with the K-K key


profiles normalized to a maximum of 1 for comparison). The two are astoundingly

similar, showing identical relationships with only small scaling differences between

the two. This demonstrates that the diffusion space created by the database and its

organization of the notes actually corresponds almost identically to perceptual data.

By no means is this a suggestion that the diffusion process is in any way related to

perception, but it is highly relevant that the two create the same hierarchy for the

pitch classes.

A more likely conclusion is that both the K-K key profiles and the diffusion time

constant provide accurate representations of a fundamental pitch hierarchy that exists

in the perceptual system and therefore has guided much of Western music theory as

well.

5.2.2 Functional Code-Based Key Organization

We now circle back to experimentally examine the conclusions drawn on rare intervals

in Section 5.2.1. Using the diffusion time constants, several suggestions were made

about the viability of the tritone, major 3rd, and minor 2nd in key-finding applications.

Here, we will test these intervals plus a few other functional characteristics of music.

Method

These experiments are conducted on the Bach database, though all works are trans-

posed into all keys to fully use the data. Each track is represented by the number

of occurrences of the interval or set of intervals used in the particular experiment.

The intervals are separately counted for simultaneous occurrences, steps up by the

interval, and steps down by the interval. These counts are then used to create an

organization of the pieces in a diffusion space, and the diffusion time constant is

subsequently calculated.

In order to evaluate the key-finding capabilities, 90% of the data is randomly

assigned as training data (using the same set for all tests), and the keys for the other

10% of the data is determined using a k-nearest neighbors algorithm, defining the

neighbors as those closest in the diffusion time constant.


Method Accuracy

Tritone 55.65%Major 3rd 76.52%Minor 2nd 73.04%Dominant 73.91%

Combo 91.30%Harmonic 93.91%

Table 5.1: Accuracy for various interval-based key-finding approaches using nearestneighbors in the diffusion time constant.

The tritone, major 3rd, and minor 2nd were all tested separately. A dominant-

based interval relationship was also tested on its own, in which the scores are searched

for occurrences of a descent by a perfect 5th, which is the root progression of the

dominant cadence. Then, all four of these were combined for one test. Finally,

a full harmonic examination was performed, counting the occurrences of all note

combinations simultaneously (in this case, temporal progression was not used).

Results

The results for all of the experiments, which can be seen in Table 5.1, show that all

methods work reasonably well, since the random threshold for 24 classes is 4.17%, and

all methods perform significantly above this. The relative results also support much of

the analysis from Section 5.2.1. First of all, the tritone gives the worst performance,

which was expected based on the potential confusion between parallel majors and

minors as well as keys separated by a tritone. Major 3rd and minor 2nd approaches

performed significantly better, which was also expected, since only relative major and

minor key confusion appears fundamentally problematic for the intervals.

The confusions for these methods can be seen in Fig. 5.5, where incorrect estimates

are shown for the major and minor pieces with important types of mistakes labeled.

The corresponding diffusion maps (in the predominant dimensions) are shown in Fig.

5.6. In the diffusion maps, it is possible to see why the common confusions for each

approach occur.


ParMaj RelMaj Tri ParMin Tri RelMin0

0.05

0.1

0.15

0.2

0.25

MajorMinor


(a) Tritone


0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

MajorMinor


(b) Major 3rd


0.1

0.2

0.3

0.4

0.5

MajorMinor


(c) Minor 2nd


0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

MajorMinor


(d) Dominant


0.1

0.2

0.3

0.4

0.5

MajorMinor


(e) Combo


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

MajorMinor


(f) Harmonic

Figure 5.5: Confusions for all 6 functional key-finding experiments with noteworthyconfusions labeled.


−0.1

−0.05

0

0.05

0.1

−0.2−0.100.10.2−0.1

0

0.1

C

C#/Db

D

D#/Eb

E

F

F#/Gb

G

G#/Ab

A

A#/Bb

B


(a) Tritone

−0.06

−0.04

−0.02

0

0.02

0.04

0.06−0.06 −0.04 −0.02 0 0.02 0.04 0.06

C

C#/Db

D

D#/Eb

E

F

F#/Gb

G

G#/Ab

A

A#/Bb

B


(b) Major 3rd

−0.06 −0.04 −0.02 0 0.02 0.04 0.06−0.06

−0.04

−0.02

0

0.02

0.04

0.06−0.1

0

0.1

C

C#/Db

D

D#/Eb

E

F

F#/Gb

G

G#/Ab

A

A#/Bb

B


(c) Minor 2nd

−0.06 −0.04 −0.02 0 0.02 0.04 0.06 −0.06

−0.04

−0.02

0

0.02

0.04

0.06

−0.1

0

0.1

C

C#/Db

D

D#/Eb

E

F

F#/Gb

G

G#/Ab

A

A#/Bb

B


(d) Dominant

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

−0.06−0.04−0.0200.020.040.06−0.1

0

0.1

C

C#/Db

D

D#/Eb

E

F

F#/Gb

G

G#/Ab

A

A#/Bb

B


(e) Combo

−0.06

−0.04

−0.02

0

0.02

0.04

0.06−0.06 −0.04 −0.02 0 0.02 0.04 0.06

−0.1

0

0.1

C

C#/Db

D

D#/Eb

E

F

F#/Gb

G

G#/Ab

A

A#/Bb

B


(f) Harmonic

Figure 5.6: Diffusion maps for all 6 functional key-finding experiments.


In the tritone case, the errors are mostly distributed, though there are peaks at

both tritone intervals, as expected. This confusion trend, seen in Fig. 5.5(a), is

reinforced in the diffusion map (Fig. 5.6(a)) in which only 6 of the keys can even be

seen. This is because the confusion between tritone-separated keys is so significant

that the keys one tritone above (or below) the visible keys are covered up visually.

Note that the tritone is the only key-finding algorithm shown here that ever confuses

a key with a key one tritone away. It is also interesting that the majority of the errors

for the tritone interval guess a minor key.

However, for the major 3rd and minor 2nd experiments (Figs. 5.5(b) and 5.5(c),

respectively), relative major and minor confusions clearly dominate the errors. This

was also predicted from the diffusion time constant plots in Figs. 5.2 and 5.3, and,

in the diffusion maps (Figs. 5.6(b) and 5.6(c), respectively), this same trend can be

seen, where the separation between the dominants (along the circle of fifths that the

structure is built around) is much more significant than the separation between the

relative majors/minors.

The dominant approach is intended to utilize the prevalence of the dominant

cadence as a key-finding cue. Most Western music has a dominant cadence to the

tonic at the end of the piece, and often at the end of phrases or other fundamental

segments as well. This approach certainly has its potential flaws as well, particularly

in that, not only do parallel major and minor keys have the same roots for the

dominant cadence, but the perfect 5th interval can occur in numerous contexts and

is not necessarily key-specific. The hope, though, is that the profile of perfect 5th

intervals will indicate the proper key. And, it is clear in Table 5.1 that this is a

relatively effective approach, yielding comparable results to the best rare-interval

results. And, as expected, many of the errors are relative major and minor confusions,

as seen in Fig. 5.5(d). Interestingly, the map (Fig. 5.6(d)) would lead us to believe

that the common error would be between relative majors and minors, rather than

parallel, but the confusions show this not to be the case. The parallel major/minor

confusion is likely represented in a lower (though clearly still significant) dimension

of the map.

One noteworthy observation in these four approaches is that their diffusion maps


Method Accuracy

K-S 93.75%K-S + diffusion filter 98.96%

Table 5.2: Accuracy for the K-S key-finding algorithm before and after processingthe data with a filter derived from hierarchical clustering in the diffusion space.

(Figs. 5.6(a)-5.6(d)) show different orientations of the keys, particular with regard to

the connection between major and minor keys. While they are all fundamentally built

on two circles of fifths, those circles are not positioned in the same way in every case

(for example, in Fig. 5.6(b) the minor circle is within the major circle, while, in Fig.

5.6(c) they are stacked on top of each other). This suggests that the approaches may

be complimentary, which is also reinforced by the different types of error encountered.

So, we would expect that combining them would yield good improvement, and this

turns out to be the case. Even though none of these methods achieves greater than

77% accuracy, combining them all gives an accuracy of 91.30%. The majority of the

errors are once again relative major and minor mistakes.

Finally, the best results are seen from the full code set of all harmonic combinations

of notes. Even without temporal progressions, a full count of all note combinations

correctly identifies 93.91% of the keys, and once again, the majority of the errors are

confusions between the relative major and minor keys.

5.2.3 Extending the K-S Algorithm with Clustering

It is also possible to use distributional inputs for diffusion-based key finding. One

simple example is to use the pitch-class distributions used in the K-S key-finding

algorithm. Clustering the Bach database (this time without transposing into all

keys) into a hierarchical tree from the diffusion time constant yields the tree in Fig.

5.7. In this plot, the three incorrectly grouped works are circled in red (this equates

to an accuracy of 96.88%). These errors are Prelude No. 10 in E minor (BWV 855)

from Book 1 and Prelude No. 2 in C minor (BWV 871) and Prelude No. 11 in F

minor (BWV 881) from Book 2.



Figure 5.7: The hierarchical tree created from pitch-class distributions labeled withkey with errors circled in red.


If the K-S key-finding algorithm is applied directly to these same pitch-class dis-

tributions, it correctly labels 93.75% of the pieces, incorrectly labeling Fugue No. 3

in C] major (BWV 848) and Prelude No. 21 in B[ (BWV 866) from Book 1 and

Prelude and Fugue No. 15 in G major (BWV 884), Fugue No. 19 in A major (BWV

888), and Prelude No. 23 in B major (BWV 892) from Book 2 (6 total mistakes).

However, these mistakes are completely different from those made in the diffusion

clustering, and, as it turns out, are a different type of mistake.

Diffusion Cluster Filtering

We can extend the K-S key-finding algorithm by inserting the hierarchical clustering

shown in Fig. 5.7 as a pre-processing step. We can take advantage of the unsuper-

vised grouping that has been performed in the diffusion space by filtering objects

grouped together with each other, essentially making objects deemed to be similar

more similar.

To create the filter for this, we first define the distance between two works in the

tree as simply the number of branches between them, or the number of steps in the

tree it takes to move from one data point to another. A filter can then be defined

from this distance. In this case, we will use a simple filter that only filters objects

that are within 3 steps of each other, and gives an extra weight to each step closer

than 3 that the two objects are. This matrix then needs to be normalized for filtering.

f(xm, xn) =i(xm, xn)

K−1∑p=0

i(xm, xp)

, where i(xm, xn) =

3 : 0 steps between xm and xn



By inserting this filter in front of the K-S key-finding algorithm, the accuracy

improves to 98.96%, only erring in the label of Fugue No. 15 in G major (BWV 884)

from Book 2. The two performances can be seen side-by-side in Table 5.2.


Method Accuracy

Euclidean 89.86%Diffusion time constant 91.30%

Table 5.3: Accuracy for the meter induction task using nearest neighbors with bothEuclidean distance and the diffusion time constant.

5.3 Meter Induction

Meter is another high-level musical attribute that humans extract effortlessly while

computers struggle with the task. But, like key, finding algorithms to determine

meter is an important step towards a full computational understanding of music.

Method

Most approaches for computational meter induction use a feature set derived from

autocorrelation of some representation of the musical signal. Sometimes these repre-

sentations incorporate high-level information such as melodic contour or perceptual

accent.

Here, we will use the one of the simplest representations, which is to only count

onsets. To extract this vector from a musical score, simply count how many notes

have their onsets at any given rhythmic unit. This is collected into a time series,

which we then input into an autocorrelation function. Then, one small extension

made from here is to sum together the coefficients of the autocorrelation functions

for multiples of 2, 3, 4, etc. These sums are collected into one vector which is used

for the input. This extension is simply added to encourage the grouping of duple and

triple meters.

The onset-based vectors were calculated for 686 melodies of Germanic origin from

the Essen folksong database [65]. The classification was then performed with a k-

nearest neighbors algorithm in both Euclidean and diffusion space. The training set

was randomly selected as 80% of the data, and the remaining data was used for

testing.


−0.10

0.10.2

0.3 −0.15−0.1

−0.050

0.050.1

−0.2

−0.1

0

0.1

0.2


Figure 5.8: The first three dimensions of the diffusoin map for meter classification onthe Essen folksong database colored by meter label.

Results

The classification accuracy for both the diffusion space and Euclidean space can be

seen in Table 5.3. Though the nearest neighbors algorithm performs quite well in

Euclidean space, the move to diffusion space improves the performance by 1.5%.

However, beyond this improvement, diffusion offers the advantage of visualization.

The first three dimensions of the diffusion map are shown in Fig. 5.8 with data

points colored by meter. In the diffusion space, the folksongs with duple and triple

meter are organized approximately into their own plane. The duple plan and triple

plane meet, forming a V-shaped organization. However, at the joining point, there is

mixing between the two data sets, indicating the separation for those melodies is not

as clear.

Fig. 5.9 shows the same plot, but resized according to errors. The training data

is essentially gone from this plot. Test data that was labeled correctly is plotted

with smaller dots and mistaken labels are plotted with large dots. As we would

expect from Fig. 5.8, the errors are concentrated along the spine of the V where the


−0.10

0.10.2

0.3 −0.15−0.1

−0.050

0.050.1

−0.2

−0.1

0

0.1

0.2


Figure 5.9: The same as Fig. 5.3 with the test data indicated by larger size, anderrors in labeling of the test data largest.

duple and triple meter planes meet. We could potentially predict the reliability of a

classification based on its proximity to this spine.

So, in addition to a small improvement to the results, moving the classification to

diffusion space also offers the ability to visualize the classification itself. Based on the

location of a data point with an unknown label in this space, it can be determined

how certain the classification is, and, on a higher level, how strong the metric identity

of the music itself is.

5.4 Visualization of Trajectories

Creating a visualization of a musical excerpt can provide an interesting artistic tool

for experiencing music in a multimedia environment. A visual analog can provide

a different type of insight into a musical piece, and it can also create intuitive and

natural representations of musical concepts without requiring a rigorous background

in music theory. In this way, a good visualization can serve as an analytic tool, an

artistic extension, and a didactic mechanism, all at the same time.


Diffusion mapping is ideal for creating musical trajectories for several reasons.

First of all, as has been shown repeatedly, the dimensions of the diffusion map hi-

erarchically represent the global structure of a data set. This allows for automatic

creation of the trajectory. Also, any affinity function can be used to relate sections of

the music. So, the flexibility of the diffusion process gives the opportunity to design a

system for plotting desired characteristics. The fact that the diffusion space in which

the trajectory exists is created specifically for the data is advantageous as well, since

this means there does not need to be some universal space that will not work as well

for some excerpts than others.

To create the trajectories that follow, a very simple representation was used. A

musical score is broken into small, overlapping windows, only one or two beats in

duration. The affinity between these windows is defined exclusively by how many

notes they share in common at the same relative time.

This approach creates an interesting diffusion space, where the trajectory of the

music is the movement of the excerpt within that space. Different regions will cor-

respond to harmonic distinctions defined by the interval combinations in the music,

because chords that share notes will also share a relatively high affinity. Also, the

circular connectivity of inversions and variations on chords reinforces this closeness

in diffusion space. These harmonic regions will be related to each other by harmonic

pathways created by the presence of intermediate note combinations between those

regions (such as a G major chord and E minor chord being connected by a {G,B}major triad).

However, the time window will also connect harmonic regions that are located

temporally close to each other in the music. As the music progresses from one chord

or note to another, the sliding windows that cover that transition create steps between

those two states. This essentially creates a pathway for the diffusion process to travel

between the two states, and the more pathways that are created, the closer the regions

will be in the diffusion space.

So, the diffusion space will have harmonic regions that are drawn close to each

other by harmonic and temporal commonalities. The collection of these pathways

determines the organization.


Figure 5.10: The melody of Twinkle, Twinkle, Little Star

All of the trajectories to be shown can also be seen as animations online [68].

This is highly recommended, because the addition of temporal movement visualizes

elements like harmonic progression and rhythm. Also, watching the animations in

sync with the music makes it much easier to understand the harmonic layout and the

regions of the diffusion space.

5.4.1 Twinkle, Twinkle, Little Star

Fig. 5.11 shows the trajectory created in the first three dimensions of the diffusion

space for the melody Twinkle, Twinkle, Little Star (Fig. 5.10). This melody is mono-

phonic, and so the orientation of the space is completely determined by the temporal

relationships. That makes this trajectory a particularly good initial example, since

the entire organization is the product of only one characteristic: the melody itself.

To help see how the temporal pathways shape the harmonic regional distribution, the

notes in the melody are marked in the trajectory.

It is very easy to see in this picture how the temporal connections form a pathway

between the notes, and how these connections create the organization in the diffusion

space. Moving along the exterior of the circle in a clockwise direction follows the con-

nections of the first four measures of the melody. These pathways show particularly

clearly how the organization is affected by the connectivity, because the A is only

connected to the G, and as a result is positioned very close to it. The A shares no

direct temporal relationship in this particular melody in any note but G, and this is

clearly reflected in the trajectory. The circular shape of the first four measures also

shows that the phrase starts and ends in the same state (the tonic).

The pathway connecting the D and G is a product of the movement from measure


−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

−0.2−0.1

00.1

0.2−0.2

00.2

C

D

E

F

G

A


Figure 5.11: The trajectory for Twinkle, Twinkle, Little Star with the individualnotes marked.


6 to 7. This is the only interval in the entire melody that is not stated in the first

four measures. As a result, the D and G are pulled closer to each other. This is

reflected in the vertical dimension. The pathway between the D and G is the spine

of a saddle shape that the exterior circle creates, with the two ends (C on one side

and F and G on the other) dipping downward. So, in this way, D and G are pulled

together while still maintaining the distance between the notes that do not share a

direct connection.

Through these connections, a shape is created that represents the melodic move-

ments of Twinkle, Twinkle, Little Star and the corresponding harmonic implications.

However, it is important to understand that, because no preconditions of music theory

were imposed on the diffusion mapping, the visualization is designed entirely based

on the music. Purely in the context of this melody, C and E have no relationship

except through D. A has no relationship to any other note except through G (and, in

the case of E, through multiple notes). Diffusion mapping takes these relationships

and maps them appropriately in a low-dimensional space.

5.4.2 Prelude No. 1 in C major (BWV 846) from The Well-

Tempered Clavier, Book 1

The arpeggiated character of Bach’s Prelude No. 1 in C major (BWV 846) from The

Well-Tempered Clavier, Book 1, (Fig. 5.12), with a more sophisticated underlying

harmonic progression and a significantly longer duration, is more complex and elab-

orate than Twinkle, Twinkle, Little Star, and so, as a result, its trajectory is much

more complex, seen in Fig. 5.13.

One of the most pronounced characteristics of this trajectory is the sections that

jump away from the most common region. In Fig. 5.13, the lower left portion is the

area of most common activity, with the paths leading out into the upper right as the

unusual projectiles. These sections correspond approximately to parts of measures

14-17 and 20-24. Looking at the score, these are sections of tonal drift and harmonic

unrest. In the first section (14-18), the unrest is set off by a diminished seventh in

measure 14, and it is followed by several progressions that do not strongly pronounce


Fig

ure

5.12

:Sco

reof

Bac

h’s

Pre

lude

No.

1inC

maj

or(B

WV

846)

from

The

Wel

l-T

empe

red

Cla

vier

,B

ook

1.


−0.100.1−0.1−0.0500.050.1

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

0.08

0.1


Figure 5.13: The trajectory for Bach’s Prelude No. 1 in C major (BWV 846) fromThe Well-Tempered Clavier, Book 1.


the key or push harmonic movement forward (an inverted tonic, inverted IV 7 and ii7).

The original key is not restored until the dominant-tonic statement in measures 18-19.

A similar tonic drift occurs in measures 20-24, in which the subdominant is tonicized

(m. 20-21), followed by a movement toward the dominant. Once the dominant is

restored in measure 24, the trajectory has returned to the lower left region of Fig.

5.13.

Another noteworthy characteristic of the piece that is visually communicated in

the trajectory is the cyclic nature of the tonal progressions. With few exceptions,

every measure consists of an arpeggiated chord repeated twice (with two smaller rep-

etitions of the final three notes within each repetition). These cycles are represented

in the trajectory by circular paths.

On a general level, the trajectory is essentially built of relatively straight paths

with junction points where the path turns sharply. At each of these junction points,

there is a thick joint that, when viewed from the appropriate angle, is seen to be

circular. Several of these circular junctions can be seen in Fig. 5.13, with the most

clear in the center of the plot in orange or near the upper right in green.

These circular junctions communicate the cyclic nature of the tonal progressions

in the prelude. Also, the scale of the circles is accurate to the musical experience.

This piece is not one large cyclical movement (unlike the opening phrase of Twinkle,

Twinkle, Little Star). Instead, it is built from numerous atomic cycles that create

the harmony of the piece. This element is precisely represented in the trajectory.

To give a clearer picture of these circular junctions, the section of the trajectory

that corresponds to the first four measures is shown on its own in Fig. 5.14. Here,

the greater shape is a closed triangle, because the first four measures happen to begin

and end in the tonic. However, the elbows of the triangle can clearly be seen here to

have a circular shape. Each of these circles represents a chord, and they then group

together by their associations to form the tonal regions. Here, the corners of the

triangle correspond to the tonic, dominant, and subdominant tonal regions. As we

already showed in Fig. 5.13, these harmonic regions all align together, separate from

the regions of tonal drift already discussed.


−0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 −0.04

−0.02

0

0.02

−0.06

−0.04

−0.02

0

0.02


Figure 5.14: The trajectory for Bach’s Prelude No. 1 in C major (BWV 846) fromThe Well-Tempered Clavier, Book 1, with only the first four measures shown.


5.4.3 Robustness to Performance Noise

The connectivity-based organization approach that diffusion utilizes is highly robust

to noise and distortion. This is because, while low-level noise may have a significant

effect on the specific distances between data points, it takes a great deal more noise

to significantly affect the orientation of the local geometry, and even more to affect

the global geometry. So, in the diffusion space, the dimensions corresponding to

larger eigenvalues (and therefore representing the global structure) should be largely

unaffected by reasonable amounts of noise. Instead, the noise is restricted to the

dimensions with smaller eigenvalues, since the noise is a largely local effect. Of

course, everything has a breaking point, but the hierarchically structured nature of

the diffusion space makes the process more robust to noise.

To demonstrate this, performance-like noise was synthetically added to Bach’s

Prelude No. 1 in C major (BWV 846) from The Well-Tempered Clavier, Book 1.

The timing of the note onsets was varied by an amount determined by a sinusoidal

oscillation with random noise added in. Furthermore, occasional random notes were

added in, to simulate mistakes.

Fig. 5.15 shows the trajectories for these signals, with several levels of distortion.

The lowest level (Fig. 5.15(a)) distorts the timing by ±.05 seconds and adds in five

errant notes. With this minor noise included, the trajectory still looks identical to

the undistorted trajectory (Fig. 5.13). Increasing the distortion to ±.1 seconds and

10 errant notes, the trajectory (Fig. 5.15(b)) still looks very similar. Even in the

highest level of distortion, with the timing adjusted by ±.2 seconds and 20 mistakes

added, the trajectory is still recognizably similar, although at this point the effects

are shown by the instability of the pathways and small warbles in the trajectory.

However, considering the level of distortion in the music at this point, this is still an

impressive demonstration of robustness.

It is also worth noting that, in all of the cases, the harmonic regional layout is

still unaffected, because, even with all the distortion, the harmonicity of the music

has not changed.

The growing instability of the trajectory and variations in the timing information

are more salient in the animations available online [68], and the consistency of the


−0.10

0.1

−0.15−0.1−0.0500.050.1

−0.1

−0.05

0

0.05

0.1


(a) Low Noise

−0.1

0

0.1

−0.15−0.1−0.0500.050.1

−0.1

−0.05

0

0.05

0.1


(b) Mid-level Noise

−0.100.1 −0.1 −0.05 0 0.05 0.1 0.15

−0.1

−0.05

0

0.05

0.1


(c) High Noise

Figure 5.15: Several trajectories for Bach’s Prelude No. 1 in C major (BWV 846)from The Well-Tempered Clavier, Book, with different levels of performance-like noisesynthetically added


regional layout despite these distortions is also clearly visible as well.

5.4.4 Audio Signals

So far in this work, only symbolic musical representations have been used for the

demonstrations and experiments. However, the abilities of diffusion extend into the

domain of audio signals as well.

To demonstrate this, we will recreate the trajectory for Twinkle, Twinkle, Little

Star (Fig. 5.11) from an audio signal of the melody, played by a piano. However, in

order to accomplish this, a second level of diffusion needs to be added, creating what

we will call a diffusion super graph. In this case, the process breaks the harmonic and

temporal relationships into two separate levels.

The first layer of diffusion takes the Discrete Fourier Transform (DFT) of a win-

dowed segment, 100ms long. Only the magnitude of these Fourier coefficients is used

as the features for the window. The time slices can then be organized with a diffu-

sion map based on the affinity of the Fourier coefficients, which will group individual

notes (or, for a more musically rich signal, chords) together into their harmonic re-

gions. However, unlike in the symbolic trajectories, in this case only the harmonic

relationships will create this space, because the temporal progression is ignored at

this level. But, the physics of resonating sound, with multiple harmonic frequency

components in a single sound source, does play an important role at this level. The

affinity between two frames will be determined in large part by the number of har-

monics they share in common. So, obviously two slices of the same note will have a

high affinity. But, important relationships like the perfect 5th will also have a relevant

role here.

The second layer of the graph tracks the movement of the signal within this space

and breaks this movement down into small sections, several frames long. The distance

between the two points (which can be derived from the diffusion distance, the diffusion

time constant, or the Euclidean distance in a subspace of the full diffusion space) then

defines the affinity of the second layer of the super graph. Theoretically, additional

layers can be added to the graph indefinitely, if desired.


−0.3−0.2

−0.10

0.10.2

−0.2

−0.1

0

0.1

0.2

−0.4

−0.2

0

0.2

0.4


Figure 5.16: The trajectory for Twinkle, Twinkle, Little Star, derived here from anaudio signal instead of symbolic music (as was the case in Fig. 5.11).


By following this two-layered approach, the trajectory seen in Fig. 5.16 was

created from the audio signal. A quick comparison shows the layout to be quite

similar to the trajectory derived from the symbolic data in Fig. 5.11. The same

circular exterior is seen, and the spine across the top of the overall saddle shape is

present as well. The most noteworthy difference is that the small loop that connects

G and A (in the upper right of the symbolic trajectory) is stretched out significantly

further (comparatively to the rest of the structure). This is likely because of the

largely unconnected harmonic sets for these two notes. So, while the two notes are

pulled together temporally, G feels a stronger harmonic connection to other notes,

most particuarly C and D. D and A have a harmonic relationship as well, but their

lack of temporal proximity appears to override this connection in the organization.

The success of this experiment makes two important points. First, the diffusion

analysis in this work is not specifically restricted to symbolic music and can be ex-

tended to audio music as well, an important feature for general applicability to many

music applications. Second, this suggests that the diffusion analysis is in fact ex-

tracting deeper musical information than simply what is available at face value. The

audio signal and symbolic data share no superficial characteristics in common. They

are completely different representations of the exact same musical sequence, but the

only way that similarity can be seen is through an understanding of that underly-

ing musical structure. Diffusion mapping creates a space in which these different

representations map to that musical structure consistently.

5.5 Conclusion

This chapter presents a wide range of applications for diffusion mapping in music

analysis. In key analysis, diffusion space was used for functional key analysis, and

the organizations created with structural input improved the performance of the K-S

key-finding algorithm through a cluster-based filter built in the diffusion space. In

meter induction, not only does the diffusion-based approach improve on the same

algorithm in Euclidean-space, but the regions of confusion can easily be seen in the

diffusion space.


Trajectories in diffusion space are a particularly interesting application. The visu-

alizations created were shown as means for analyzing and understanding the music.

The visualizations project characteristics of the music into an intuitive representation

that communicate the musical structure without requiring an actual understanding

of the theory behind the structure. Circular structures, big and small, communicate

cyclical sections of music. Tonal drift is shown through regional drift. In this way,

these trajectories are not unlike the geometric representations shown in Chapter 4,

in that they give an intuitive visual representation of high-level musical concepts.

Combining this ability with the creation of the same trajectory from symbolic

and audio representations of a melody demonstrates clearly that the structures and

analyses created in diffusion space are based off the deeper musical concepts rather

than simply superficial features. This is a powerful conclusion and summarizes the

important role that diffusion analysis can play in musical applications.

Chapter 6

Conclusions and Future Work

6.1 Summary of Contributions

This dissertation has introduced a fundamentally novel approach for music analysis

at multiple levels, and so it is difficult to isolate the contributions. The type of

analysis performed is completely new, particularly the theoretical-level analysis in

Chapter 4. However, there are a few particularly salient developments that are worth

emphasizing here.

6.1.1 Diffusion Time Constant

The diffusion time constant is a new metric for diffusion space, and makes a contribu-

tion to the field of diffusion mapping. It extends the diffusion distance in cases where

selecting an optimal time parameter t is difficult, or where the diffusion distance is

insufficient for simultaneously representing the full hierarchy of the structure.

Because finding the diffusion time constant requires utilizing Newton’s method

for solving the sum of exponentials, it can take a significant amount of computation

to calculate, and therefore is not ideal for every application. In many cases, using the

diffusion distance or selecting a few optimal dimensions of the diffusion map provides

a perfectly sufficient metric for analysis. But, in certain cases, such as automatic

processes where the time parameter t cannot be manually selected, or hierarchical

119

CHAPTER 6. CONCLUSIONS AND FUTURE WORK 120

clustering where encoding all levels of the structure is desired, the diffusion time

constant provides a valuable extension to diffusion analysis.

6.1.2 Assumption-Free Music Analysis

All analyses performed in this work were accomplished with as few assumptions as

possible. In this way, we were actually able to derive music theory, rather than just

confirm that the music fits a set of preconceived notions. This type of analysis, partic-

ularly on the interval and meter level, is completely new and unique. Computational

music analysis has previously sought many elements of music like phrasing and theme,

but the approach presented here of building a computational understanding of music

theory from the ground up is novel.

The lack of assumptions is particularly significant, because previous computational

approaches often tried to incorporate musical knowledge, rather than exclude it. In

some ways, the inclusion of prior knowledge can be valuable, but the assumption-free

analysis shown here offers a new perspective on music. First of all, it allows for the

truly individual analysis of a single musical work. The trajectories shown in Section

5.4 are unique in this way, because not only is their movement through the diffusion

space based on the music, but the space itself is based only on the music. As a result,

the diffusion space is catered specifically to the relationships created in the music,

rather than trying to adapt the representation to fit a preconceived musical space.

On a higher level, the use of assumption-free analysis also demonstrates how

fundamental some of the mathematical relationships of music are. The diffusion

approach presented here does not even truly utilize the knowledge that the data is

music. Instead, it is only a series of objects connected by relationships defined by the

“data.” But, still many significant representations of music theory are created at a

fundamental level. This is a powerful realization about the structural importance of

music theory in the musical data. And, it can only be truly demonstrated by starting

without any assumptions at all.


6.1.3 Musical Visualizations

One of the most exciting and interesting contributions of this work is an easy and

conceptually reasonable process for creating musical trajectories for visualization.

As will be discussed in the Future Work section, there are many ways to extend

these visualizations for the future, but the work accomplished here is already quite

significant and novel.

By breaking symbolic music into a time series of sliding windows, the piece is

visualized as a pathway through a diffusion space created specifically for the musical

relationships. One easy application of this visualization would be for a multime-

dia musical experience, where the visual pathway is directly related to the auditory

experience. But, because the regions within the space are organized based on the

relationships within the music, the movement of the pathway within the space is

meaningful on an analytic level as well. Harmonic regions that are commonly asso-

ciated within an excerpt will be brought closer together while those that are only

rarely connected will be pushed further apart. In this way, a long distance distance

travelled by the pathway could relate to the perceptual cost of a rare harmonic shift.

The type of movement can be informative as well, representing temporal patterns.

Music has previously been visualized as pathways through a space, but the assumption-

free and automatic process that diffusion allows is a unique and interesting contribu-

tion to computational music analysis.

6.2 Future Work and Extensions

The work presented in this dissertation inspires the possibility of many future projects.

Through both improvements and extensions, there are many avenues to progress from

here.

6.2.1 Audio Signal-based Analysis

The most obvious extension is to perform the musical analyses presented in this

work on musical signals rather than symbolic music. The feasibility for this was


demonstrated for one trajectory in Section 5.4.4, but a more thorough examination

would be beneficial.

A particularly interesting effect of the move to the signal domain is briefly men-

tioned in Section 5.4.4, the effect of the harmonics created by resonant bodies. In

diffusion spaces derived from symbolic music, the spatial relationships are based ex-

clusively on the relationships in the music itself, either temporally or harmonically.

However, acoustic notes resonate at multiple frequencies simultaneously, the funda-

mental and several frequencies at multiples of the fundamental. The relative overlap

of these partials adds an extra relationship for the organization, in addition to the

musical relationships already examined in the symbolic context.

Expanding into the signal domain also allows for the organization of other types

of information, such as instrumentation or timbre. This would add yet another layer

of organization and analysis.

The signal domain also presents the opportunity for other applications for organi-

zation. For example, the first layer of the super graph in Section 5.4.4, in which the

tonal relationships between different spectral coefficients is established, suggests that

diffusion organization could potentially be used for note transcription. Other signal-

level classification tasks such as instrument identification could be solved similarly as

well.

6.2.2 Improved Visualization Platform

The visualizations presented in this work and on the web were all created in Matlab.

This environment is sufficient for the examinations here, but a more visually dynamic

and interactive platform would be an extremely valuable extension. Representing the

data points as dots in a static 2-D projection gives a good starting point, but it is

not hard to envision numerous possible improvements.

One such improvement would be to plot the trajectories as real trajectories instead

of a series of dots. The pathways would be more clearly marked in this context, and

it would more accurately represent the flow of the music, rather than the discretized

concept that the dots present. Identifying clear structures in the diffusion space, such


as cones or planes, and plotting them as these shapes rather than just the data points,

could be a valuable extension for making the visualizations more accessible as well.

The viewing perspective could also be changed in several ways. Interactive control

of the angle during an animation in particular would allow for the structures to be

more easily seen. An even more interesting possibility is for the user to move the

perspective within the diffusion space, creating the concept of flying through the

data. And, in the trajectory case, the perspective could even ride the trajectory

through time, as if on a roller coaster.

These extensions to the visualization would be a direct way to improve the us-

ability of diffusion-based music analysis and enhance the user experience of systems

derived from the analysis. As shown in this work, there is a great deal of poten-

tial for diffusion mapping as a musical visualization tool, for exploration, education,

and artistic experience, but the potential would be vastly enhanced with a visually

dynamic and fluidly interactive platform.

6.2.3 Comparison of Diffusion Spaces

Throughout this work, diffusion spaces are created from small sets of musical data.

Because the diffusion analysis is assumption-free, these spaces are based entirely on

their musical input, and so they are unique to those musical sets. Unfortunately, one

drawback to the musical representations existing in different spaces is that they cannot

be directly compared to each other. Because the spaces have different meanings and

orientations, they cannot simply be treated as the same space for comparison.

So, it would be a significant extension to develop a mechanism for comparing these

unique diffusion spaces and the trajectories or structures within them. This would be

valuable for several reasons. First, it would allow for a deeper understanding of the

diffusion spaces themselves, since they could be contextualized with each other. The

truly unique characteristics of a certain musical space could be more readily identified

based on its comparison to other spaces. This would also provide an interesting

metric for comparing the works themselves, possibly even extending to a database

organization.


The ability to compare diffusion spaces would also open up the possibility for

deeper uses of the super graph concept introduced in the context of the audio signal

visualization in Section 5.4.4. Layers of diffusion could be connected through this

extension, creating simultaneous organizations of multiple musical levels. Diffusion

maps could be created for every measure of a musical excerpt, which could then be

organized into a map for the entire work, which could then be compared to other

works to organize a corpus, and so on.

There are several approaches that could work for solving the problem. One way

would be to compare the relative orientations of some set of reference points common

to both graphs (such as the pitch classes). Another would be to look for similar

shapes in the musical trajectories in the diffusion space, an approach that has been

previously used in comparison of images. And there are countless other potential

methods. There are many intriguing possibilities that follow from the development of

a comparison between diffusion spaces, well worth the effort of developing a reliable

means of calculating the relationship.

6.2.4 Implications for Non-Tonal Western or Non-Western

Music

All examples presented in this work used tonal Western music or tonal Western mu-

sic theory for the musical data. However, as has been mentioned many times, the

diffusion methods presented here are completely assumption-free. There was no rea-

son why the examples needed to use only tonal Western music. The data was only

selected for familiarity and for a well-studied theoretical foundation.

It would be very interesting for future work to examine the visualizations and

geometries created by elements of other musical theories. Using data from music

designed around other tonalities or scales, or even from other sets of notes, would be

a very interesting variation.

It has also been suggested here that the geometric visualizations of elements of

music theory provide a more accessible means for understanding the relationships

implied by those elements. It would be very interesting to test this hypothesis on an


unknown musical framework and see if the deeper understanding found for Western

music was predicated on a prior understanding of the theory or if the enhancements

would be the same in a more exploratory environment in which the music theory is

not yet entirely understood by the user.

6.2.5 Inverting Diffusion Space to Audio

One particularly ambitious and interesting extension would be to develop a means for

inverting data in a diffusion space back into musical data, either symbolic or signal-

based. This would allow for interactive manipulation of music in diffusion space to

create a corresponding alteration of the music. This could even potentially lead to

sonification of non-musical data that had been organized into a diffusion space.

There are two possible fundamental ways to accomplish this. One is to try to

invert the diffusion process (assuming the original data was musical). The second is

to develop a sonification process from the diffusion space.

The first approach, inversion, is mathematically challenging, because the diffusion

process is not meant to preserve the original data in the organization, but rather the

relationships between the data. It is obviously easy to recreate the Markov matrix P

from the eigenvectors by using the summation in the definition of the eigendecomposi-

tion from Eq. (3.8). Additionally, the scaling of the Markov matrix from the affinity

matrix can be determined from the stationary distribution of the Markov matrix.

Unfortunately, it is not possible to recreate the original data from the affinity matrix

for the affinity functions used in this work, because they are rotation invariant. But,

if the orientation of the data is stored separately early in the process, this would allow

for perfect inversion. Unfortunately, this picture gets complicated when manipula-

tions of the data in diffusion space are included, and so a deeper examination of this

process is called for.

The second approach, sonification, is completely free-form and therefore can be

accomplished in essentially any way. But, the resulting musical output will not neces-

sarily be as meaningful as in the inversion approach, because it is not automatically

contextualized by the musical input. That is not to say that this approach is not


useful or potentially interesting (in fact, it is extremely intriguing), but it does sug-

gest that great care would need to be taken in designing an intuitive mapping for the

audio, a need common to most sonification applications.

6.2.6 Examination of Less Prominent Dimensions of Map

The majority of this work focuses on examining the most prominent dimensions of the

diffusion map, which is to say those dimensions that correspond to larger eigenvalues.

These are the dimensions where the more global aspects of the structure are visualized.

The analyses presented here focused more on the significant aspects of structure, and

so it made sense to mainly stick to the prominent dimensions.

But, a deeper examination is needed into the less prominent dimensions of the

diffusion maps. This is where the more local aspects of the structure can be found.

Also, if there are two data sets that are built on the same fundamental structure with

small variations between them, then the prominent dimensions will show a similar

shape (as was the case with the performance-like noise in Section 5.4.3) and the

structurally minor differences are found instead in the other dimensions. So, it would

be worth examining whether these dimensions can be examined and compared to

determine the nature of the minor differences while still recognizing the fundamental

similarity. If this sort of analysis could be achieved, it would be a very powerful tool.

6.2.7 Dual Diffusion

To this point, the diffusion process has involved a series of data points with a mea-

surable affinity between them. This affinity is calculated on some set of features that

represent the data points, and then the data points are organized based on those

affinities.

A dual diffusion approach can instead be used. In this process, the data points

are treated in the exact same way, with affinities calculated based on the features.

However, in the feature space, affinities are also calculated between the features them-

selves. Then, both of these dimensions can be separately organized.

So, in this approach, two organizations can be created. In one, the data points


are organized based on common patterns in the features. In the second and new

organization, the features are organized based on common patterns in the data points.

This second map gives a completely new way of looking at the data.

Extracting organizations based on both of these views can offer an opportunity to

extract a deeper and more meaningful organization. An organization that describes

both dimensions, instead of only one, would give a more fundamental description

of the data. Such a method should also be more robust to noise and labeling er-

rors through the enhanced and complimentary understanding from the two different

dimensions of analysis.

Looking at data from multiple angles, as is suggested here, is not specific to

diffusion. In fact, any organizational method could likely be extended to include this

approach. However, diffusion mapping is particularly ideal for the process.

First of all, this dual diffusion suggestion adds an important insight to unsu-

pervised data organization, of which diffusion mapping is an example. The deepest

shortcoming of unsupervised organization that it gives no insight into why that orga-

nization exists or what criteria defines each cluster. Instead, this typically needs to be

extracted analytically. The dual approach solves this problem with the corresponding

organization in the features, which gives a set of clusters of its own that correspond

to groups of data points and therefore another means for understanding the meaning

of the structure.

Also, the hierarchical nature of diffusion space gives a valuable ranking of the

organization, so, when the organizations need to be adjusted or filtered in order to

find a common organization to both the data and the features, more fundamental

levels of the organization can still be preserved.

Finally, the distribution-free nature of diffusion mapping is useful, because it is

not ensured that viewing the set in terms of the features will have the same type of

distribution as viewing it in terms of the data points. With diffusion, this is not a

concern.

Applying this approach to musical database organization, in particular, could be

extremely innovative and valuable, because it would allow a user to not only extract

(and potentially modify) an organization but also to understand the common traits


in that organization, adding an extra element to the user’s interactivity with the

database. This would be especially ideal for musical discovery.

6.3 Concluding Remarks

This dissertation used diffusion mapping to build fundamental elements of Western

music theory from the ground up without any prior assumptions built into the sys-

tem. Despite the lack of musical knowledge, the first dimensions of the diffusion

map create geometric representations of musical elements that give another means of

understanding the musical relationships in Western music theory.

Extending this concept led to higher-level organizations based on key or meter,

and eventually led to the representation of musical excerpts as trajectories winding

through a unique diffusion space.

The work presented here attempts to provide a thorough foundation for diffusion-

based music analysis. However, it still only represents a beginning. The potential

for the non-linear analysis of music for geometric understanding and exploration is

far too vast for a complete examination in only one dissertation. The future work

suggested here offers many possible directions for the next research steps, though

these are likely only a small sample of the possible directions for diffusion research in

computational music analysis.

There are many more analytic and artistic experiments that this work shows need

to be done, both for the field of music analysis and for the field of diffusion analysis.

The depth of analysis combined with the beauty of the visual space make this work

far too intriguing to end here. Hopefully this work will inspire others to join me

in driving diffusion-based music analysis further forward and expand our concepts

of theoretically grounded music analysis, interactive multimedia musical education,

and the interaction of audio and visual representations for a truly unique artistic

experience.

Bibliography

[1] Bret Aarden and David Huron. Mapping European Folksong: Geographical

Localization of Musical Features. Computing in Musicology, 12:169–83, 1999-

2000.

[2] Bret J. Aarden. Dynamic Melodic Expectancy. PhD thesis, The Ohio State

University, 2003.

[3] Christopher T. H. Baker. The Numerical Treatment of Integral Equations.

Clarendon Press, Oxford, 1977.

[4] Roberto Basili, Alfredo Serafini, and Armando Stellato. Classification of Musi-

cal Genre: A Machine Learning Approach. In Proceedings of the International

Conference on Music Information Retrieval, 2004.

[5] Juan Pablo Bello. Grouping Recorded Music by Structural Similarity. In Pro-

ceedings of ISMIR-09, 2009.

[6] Tony Bergstrom, Karrie Karahalios, and John C. Hart. Isochords: Visualizing

Structure in Music. In Proceedings of Graphics Interface 2007, volume 234, pages

297–304, 2007.

[7] Helen Brown, David Butler, and Mari Riess Jones. Musical and Temporal Influ-

ences on Key Discovery. Music Perception, 11(4):371–407, 1994.

[8] Judith C. Brown. Determination of the meter of musical scores by autocor-

relation. Journal of the Acoustical Society of America, 94(4):1953–7, October

1993.

129

BIBLIOGRAPHY 130

[9] David Butler. Describing the Perception of Tonality in Music: A Critique of

the Tonal Hierarchy Theory and a Proposal for a Theory of Intervallic Rivalry.

Music Perception, 1989.

[10] Clifton Callendar, Ian Quinn, and Dmitri Tymoczko. Generalized voice-leading

spaces. Science, 320:346–348, April 2008.

[11] Chris Chafe, Bernard Mont-Reynaud, and Loren Rush. Toward an Intelligent

Editor of Digital Audio: Recognition of Musical Constructs. Computer Music

Journal, 6(1):30–41, 1982.

[12] Wei Chai and Barry Vercoe. Folk Music Classification Using Hidden Markov

Models. In Proceedings of the International Conference on Artificial Intelligence,

2001.

[13] Elaine Chew. Towards a Mathematical Model of Tonality. PhD thesis, Mas-

sachusetts Institute of Technology, February 2000.

[14] Elaine Chew. The Spiral Array: An Algorithm For Determining Key Bound-

aries. In Proceedings of the International Conference on Music and Artificial

Intelligence, 2002.

[15] Richard Cohn. Neo-Riemannian Operations, Parsimonious Trichords, and Their

“Tonnetz” Representations. Journal of Music Theory, 41(1):1–66, 1997.

[16] Richard Cohn. Introduction to Neo-Riemannian Theory: A Survey and a His-

torical Perspective. Journal of Music Theory, 42(2):167–80, 1998.

[17] Ronald R. Coifman and Stephane Lafon. Diffusion Maps. Applied and Compu-

tational Harmonic Analysis, 21(1):5–30, July 2006.

[18] Ronald R. Coifman and Stephane Lafon. Geometric Harmonics: a Novel Tool

for Multiscale Out-of-sample Extension of Empirical Functions. Applied and

Computational Harmonic Analysis, 2006.

BIBLIOGRAPHY 131

[19] Ronald R. Coifman and Mauro Maggioni. Diffusion wavelets. Applied and Com-

putational Harmonic Analysis, 21(1):53–94, 2006.

[20] Ronald R. Coifman, Mauro Maggioni, Steven W. Zucker, and Ioannis G.

Kevrekidis. Geometric diffusions for the analysis of data from sensor networks.

Current Opinion in Neurobiology, 15:576–584, 2005.

[21] Roger B. Dannenberg, Belinda Thom, and David Watson. A Machine Learn-

ing Approach to Musical Style Recognition. In Proceedings of the International

Computer Music Conference, 1997.

[22] Sanjoy Dasgupta and Daniel Hsu. Hierarchical Sampling for Active Learning. In

Proceedings of the 25th International Conference on Machine Learning, 2008.

[23] Ramon Lopez de Mantaras and Josep Lluis Arcos. AI and Music from Compo-

sition to Expressive Performance. AI Magazine, 23(3):43–58, 2002.

[24] Diana Deutsch. Two Issues Concerning Tonal Hierarchies: Comment on Castel-

lano, Bharucha, and Krumhansl. Journal of Experimental Psychology: General,

113(3):413–6, 1984.

[25] Diana Deutsch. The Processing of Pitch Combinations. In The Psychology of

Music, pages 349–411. Academic Press, 1999.

[26] Simon Dixon. Automatic Extraction of Tempo and Beat from Expressive Per-

formances. Journal of New Music Research, 2001.

[27] Leonhard Euler. Tentatum Novae Theoriae Musicae. Saint Petersburg Academy,

1739.

[28] Jonathan Foote and Matthew Cooper. Visualizing Musical Structure and

Rhythm via Self-Similarity. In Proceedings of the International Computer Music

Conference, 2001.

[29] Edward Gollin. Some Aspects of Three-Dimensional “Tonnetze”. Journal of

Music Theory, 42(2):195–206, 1998.

BIBLIOGRAPHY 132

[30] Emilia Gomez and Jordi Bonada. Tonality Visualization of Polyphonic Audio.

In Proceedings of the International Computer Music Conference, 2005.

[31] Masataka Goto. An audio-based real-time beat tracking system for music with

or without drum-sounds. Journal of New Music Research, 30(2):159–171, 2001.

[32] Erin E. Hannon, Joel S. Snyder, Tuomas Eerola, and Carol L. Krumhansl. The

Role of Melodic and Temporal Cues in Perceiving Musical Meter. Journal of

Experimental Psychology: Human Perception and Performance, 30(5):956–74,

2004.

[33] Julian Hook. Exploring Musical Space. Science, 313, 2006.

[34] Andrew Horner and David E. Goldberg. Genetic Algorithms and Computer-

Assisted Music Composition. Technical report, Center for Complex Systems

Research, University of Illinois at Urbana-Champaign, 1991.

[35] Diane J. Hu and Lawrence K. Saul. A Probabilistic Topic Model for Unsupervised

Learning of Musical Key-Profiles. In Proceedings of the International Conference

on Music Information Retrieval, 2009.

[36] David Huron and Richard Parncutt. An Improved Model of Tonality Perception

Incorporating Pitch Salience and Echoic Memory. Psychomusicology, 12:154–171,

1993.

[37] Brian Hyer. Reimag(in)ing Riemann. Journal of Music Theory, 39(1):101–38,

1995.

[38] Ozgur Izmirli. Tonal-Atonal Classification of Music Audio Using Diffusion Maps.

In Proceedings of ISMIR-09, pages 687–691, 2009.

[39] Anil K. Jain, M.N. Murty, and P.J. Flynn. Data Clustering: A Review. ACM

Computing Surveys, 31(3), 1999.

[40] Petr Janata, Jeffrey L. Birk, John D. Van Horn, Marc Leman, Barbara Till-

mann, and Jamshed J. Bharucha. The Cortical Topography of Tonal Structures

Underlying Western Music. Science, 2002.

BIBLIOGRAPHY 133

[41] Zoltan Juhasz. Motive Identification in 22 Folksong Corpora Using Dynamic

Time Warping and Self Organizing Maps. In Proceedings of ISMIR-09, pages

171–6, 2009.

[42] Anssi P. Klapuri, Antti J. Eronen, and Jaakko T. Astola. Analysis of the Meter

of Acoustic Musical Signals. IEEE Transactions on Speech and Audio Processing,

2004.

[43] Carol L. Krumhansl. Cognitive Foundations of Musical Pitch. Oxford University

Press, New York, 1990.

[44] Carol L. Krumhansl. Tonal Hierarchies and Rare Intervals in Music Cognition.

Music Perception, 7(3):309–24, 1990.

[45] Carol L. Krumhansl. The Geometry of Musical Structure: A Brief Introduction

and History. Computers in Entertainment, 3(4):1–14, October 2005.

[46] Carol L. Krumhansl and Edward J. Kessler. Tracing the Dynamic Changes

in Perceived Tonal Organization in a Spatial Representation of Musical Keys.

Psychological Review, 89(4):334–368, 1982.

[47] Edward W. Large and John F. Kolen. Resonance and the Perception of Musical

Meter. Connection Science, 1994.

[48] Kyogu Lee and Malcolm Slaney. A unified system for chord transcription and

key extraction using hidden markov models. In Proceedings of the Eighth Inter-

national Conference on Music Information Retrieval, 2007.

[49] Fred Lerdahl and Ray Jackendoff. A Generative Theory of Tonal Music. MIT

Press, Cambridge, MA, 1983.

[50] David Lewin. Generalized Musical Intervals and Transformations. Yale Univer-

sity Press, New Haven, 1987.

[51] Hugh Christopher Longuet-Higgins. Review Lecture: The Perception of Music.

In Proceedings of the Royal Society of London. Series B, Biological Sciences,

1979.

BIBLIOGRAPHY 134

[52] Hugh Christopher Longuet-Higgins and Mark Steedman. On interpreting Bach.

Machine Intelligence, 6:221–41, 1971.

[53] Stephen Malinowski. Music Animation Machine. Online at

http://www.musanim.com.

[54] Arpi Mardirossian and Elaine Chew. Visualizing music: Tonal progressions and

distributions. In Proceedings of the Eighth International Conference on Music

Information Retrieval, 2007.

[55] Rie Matsunaga and Jun-Ichi Abe. Cues for Key Perception of a Melody: Pitch

Set Alone? Music Perception, 23(2):153–64, 2005.

[56] Panayotis Mavromatis. A Hidden Markov Model of Melody Production in Greek

Church Chant. Computing in Musicology, 14, 2005.

[57] Arthur Oettingen. Hamoniesystem in dualer Entwicklung. Dorpat und Leipzig,

1866.

[58] George Papadopoulos and Geraint Wiggins. AI Methods for Algorithmic Com-

position: A Survey, a Critical View and Future Prospects. In AISB Symposium

on Musical Creativity, 1999.

[59] Richard Parncutt. A Perceptual Model of Pulse Salience and Metrical Accent in

Musical Rhythms. Music Perception, 11(4):409–64, 1994.

[60] David Rizo, Jose M. Inesta, and Francisco Moreno-Seco. Tree-structured rep-

resentation of musical information. Pattern Recognition and Image Analysis,

2003.

[61] David Rizo, Jose M. Inesta, and Pedro J. Ponce de Leon. Tree Model of Symbolic

Music for Tonality Guessing. In Proceedings of the 24th IASTED International

Conference on Artificial Intelligence and Applications, pages 299–304, 2006.

[62] Craig Sapp. Tonal Landscape Gallery. Online at

http://ccrma.stanford.edu/∼sapp/keyscape/.

BIBLIOGRAPHY 135

[63] Craig Stuart Sapp. Harmonic Visualizations of Tonal Music. In Proceedings of

the International Computer Music Conference, 2001.

[64] Craig Stuart Sapp. Visual Hierarchal Key Analysis. Computers in Entertain-

ment, 3(4), 2005.

[65] Helmut Schaffrath. The Essen Folksong Colletion in Kern Format. [computer

database], Center for Computer Assisted Research in the Humanities, Menlo

Park, CA, 1995.

[66] W. Andrew Schloss. On the Automatic Transcription of Percussive Music –

From Acoustic Signal to High-Level Analysis. PhD thesis, Stanford University,

1985.

[67] Arnold Schoenberg. Structural functions of harmony. Norton, New York, 1969.

[68] Gregory Sell. Musical Diffusion Trajectories. Online at

http://ccrma.stanford.edu/∼gsell/Diffusion/MusicalTrajectories/, 2010.

[69] Roger N. Shepard. Geometrical approximations to the structure of musical pitch.

Psychological Review, 89:305–33, 1982.

[70] Ilya Shmulevich and Olli Yli-Harja. Localized Key-Finding: Algorithms and

Applications. Music Perception, 17(4), 2000.

[71] Amit Singer and Ronald R. Coifman. Non linear independent component analysis

of ito processes. Unpublished from Yale.

[72] Nicholas A. Smith and Mark A. Schmuckler. The Perception of Tonal Structure

Through the Differentiation and Organization of Pitches. Journal of Experimen-

tal Psychology: Human Perception and Performance, 30(2):168–86, 2004.

[73] Efstathios Stamatatos and Gerhard Widmer. Automatic identification of music

performers with learning ensembles. Artificial Intelligence, 165:37–56, 2005.

BIBLIOGRAPHY 136

[74] Martin Szummer and Tommi Jaakkola. Partially labeled classification with

Markov random walks. Advances in Neural Information Processing Systems,

2002.

[75] Annie H. Takeuchi. Maximum key-profile correlation (mkc) as a measure of tonal

structure in music. Perception & Psychophysics, 56(3):335–46, 1994.

[76] David Temperley. What’s Key for Key? The Krumhansl-Schmuckler Key-

Finding Algorithm Reconsidered. Music Perception, 17(1):65–100, 1999.

[77] David Temperley. A Bayesian Approach to Key-Finding. Lecture Notes in Com-

puter Science, 2445:195–206, 2002.

[78] David Temperley. The Tonal Properties of Pitch-Class Sets: Tonal Implications,

Tonal Ambiguity, and Tonalness. In Eleanor Selfridge-Field and Walter Hewlett,

editors, Tonal Theory for the Digital Age. Computing in Musicology, 2008.

[79] David Temperley and Elizabeth West Marvin. Pitch-Class Distribution and the

Identification of Key. Music Perception, 25(3):193–212, 2008.

[80] Naftali Tishby and Noam Slonim. Data clustering by markovian relaxation and

the Information Bottleneck Method. In NIPS, volume 13, 2000.

[81] Petri Toiviainen and Tuomas Eerola. The Role of Accent Periodicities in Meter

Induction: A Classification Study. In Proceedings of the International Conference

on Music Perception and Cognition, 2004.

[82] Petri Toiviainen and Tuomas Eerola. Autocorrelation in meter induction: The

role of accent structure. Journal of the Acoustical Society of America, 119(2),

2006.

[83] Petri Toiviainen and Carol L. Krumhansl. Measuring and modeling real-time

responses to music: The dynamics of tonality induction. Perception, 32, 2003.

[84] Dmitri Tymoczko. The Geometry of Musical Chords. Science, 313(72), 2006.

BIBLIOGRAPHY 137

[85] George Tzanetakis, Andrey Ermolinskyi, and Perry Cook. Pitch Histograms in

Audio and Symbolic Music Information Retrieval. In Proceedings of the Inter-

national Conference on Music Information Retrieval, 2002.

[86] Piet G. Vos, Arjan van Dijk, and Lambert Schomaker. Melodic cues for metre.

Perception, 23(8):965–76, 1994.

[87] Martin Wattenberg. The Shape of Song. Online at

http://www.turbulence.org/Works/song/index.html.

[88] Robert J. West and Roz Fryer. Ratings of Suitability of Probe Tones after

Random Ordersing of Notes of the Diatonic Scales. Music Perception, 7(3):253–

8, 1990.

[89] Christopher K. I. Williams and Matthias Seeger. Using the Nystrom Method to

Speed Up Kernel Machines. Advances in Neural Information Processing Systems,

13, 2001.

[90] Tjalling J. Ypma. Historical Development of the Newton-Raphson Method.

SIAM Review, 37(4):531–51, 1995.

DIFFUSION-BASED MUSIC ANALYSIS: A NON-LINEAR APPROACH … › ~gsell › pubs ›...

Documents

Transcript of DIFFUSION-BASED MUSIC ANALYSIS: A NON-LINEAR APPROACH … › ~gsell › pubs ›...