Probabilistic Music Style Modeling and a Computational ... · chitecture for music style modeling....
Transcript of Probabilistic Music Style Modeling and a Computational ... · chitecture for music style modeling....
Probabilistic Music Style
Modeling and a Computational
Approach for Conceptual
Blending in Music
Mohamed Nour Nader Abouelseoud -
s1716655
Master of Science
Artificial Intelligence
School of Informatics
University of Edinburgh
2018
Abstract
This dissertation will describe the design and implementation of a deep learning ar-
chitecture for music style modeling. Two LSTM based machine learning models will
be trained to learn different styles, Bach and traditional folk music, and generate new
pieces in their respective styles. We then propose a computational model for the blend-
ing of these two models to generate a new conceptual blending model that generated
music characteristic of both styles. I then conduct some quantitative analysis of the
outputs of the models and show that they succeed in their respective tasks to varying
extents. I also present participant survey results showing a general audience agree-
ments that the generative music models succeed in capturing the specified genres and
that the blended model almost perfectly blends the models and generates musical out-
puts that equally contains characteristics of both constituent genres.
i
Acknowledgements
I would like to thank my supervisor Alan Smaill for his advice and patience, Dr. Joe
Corneli for volunteering his time to give me constant feedback, and my friends and
colleagues for agreeing to participate in my study and intently listening to my musical
output. Credit also goes to MIT’s Michael Cuthbert and his Music21 library (2010),
with which all handling, pre-processing, post-processing and graphing of the midi and
music files was implemented.
ii
Declaration
I declare that this thesis was composed by myself, that the work contained herein is
my own except where explicitly stated otherwise in the text, and that this work has not
been submitted for any other degree or professional qualification except as specified.
(Mohamed Nour Nader Abouelseoud - s1716655)
iii
To my parents
iv
Table of Contents
1 Introduction 11.1 Project Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Background and Literature Review 42.1 What is Meant by Algorithmic Composition . . . . . . . . . . . . . . 4
2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 A Brief History of Algorithmic Composition . . . . . . . . . 5
2.2.2 Narrowing Down the Scope . . . . . . . . . . . . . . . . . . 7
2.3 Markovian and Other Statistical Models . . . . . . . . . . . . . . . . 7
2.3.1 Motivation and Initial Attempts at Using Markov Models . . . 7
2.3.2 Recent Success of Markov Models . . . . . . . . . . . . . . . 8
2.3.3 Shortcomings of Markov Models . . . . . . . . . . . . . . . 9
2.4 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 Motivation and Definition of ANNs . . . . . . . . . . . . . . 9
2.4.2 Success of ANNs in Modeling Musical Structure . . . . . . . 10
2.5 Conceptual Blending . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.1 On Conceptual Blending . . . . . . . . . . . . . . . . . . . . 11
2.5.2 Conceptual Blending in Music . . . . . . . . . . . . . . . . . 12
3 Methods 133.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 The Bach Model . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 The Traditional Folk Music Model . . . . . . . . . . . . . . . 14
3.2 Pre-Processing and Data Cleaning . . . . . . . . . . . . . . . . . . . 15
3.3 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
v
3.4.1 Recurrent Neural Networks and LSTMs . . . . . . . . . . . . 18
3.4.2 Hyper-parameters . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.3 Sampling the Next Time-step . . . . . . . . . . . . . . . . . 20
3.4.4 Conceptual Blending . . . . . . . . . . . . . . . . . . . . . . 21
4 Experiments 244.1 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Training the Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Intermediate Stages and Experimentation . . . . . . . . . . . . . . . 27
5 Evaluation 295.1 On Evaluating Music Composition Models . . . . . . . . . . . . . . . 29
5.2 On Evaluating Conceptual Blending Models . . . . . . . . . . . . . . 30
5.3 Analyzing Musical Outputs Against Training Data . . . . . . . . . . 31
5.4 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6 Conclusion 386.1 Context and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.2 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . 39
A Participant survey 41
Bibliography 44
vi
List of Figures
3.1 Triplet preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Data irregularities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Different note representations . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Bach model architecture . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1 Model training losses against epochs . . . . . . . . . . . . . . . . . . 26
5.1 Pitch class frequency histograms . . . . . . . . . . . . . . . . . . . . 34
5.2 Pitches and note lengths weighted scatter diagrams . . . . . . . . . . 35
5.3 Participant rating of the style models’ accuracy showing notable spread
of opinions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.4 Participant rating of the blending model showing polarization of opinion 37
vii
Chapter 1
Introduction
1.1 Project Overview
This report will discuss the design, development and implementation of an artificial
intelligence music composition system. This project, however, is split into more than
one part, as it is multifaceted and will be tackling many different problems within the
realm of algorithmic composition, music style modeling, the relatively recent cognitive
theory of conceptual blending, how to computationally apply it and how it pertains to
computational creativity in general.
The first part of the project is to construct two different generative machine learning
models using artificial neural networks. Each of the two models will learn from a
training set of a different music style than the other and their output will be evaluated
both objectively and subjectively through user surveys. The two models will be used
in the second part of the project.
The second part examines the cognitive theory of conceptual blending and its rela-
tion to computational creativity and generative Artificial Intelligence. The two genera-
tive models implemented in the first part will be used and different hypotheses for com-
putationally implementing conceptual blending using the two models and their outputs
will be demonstrated and examined and their individual effectiveness as it compares to
the others will be commented on.
1.2 Organization
The report will start with a preamble containing some commentary on the use of the
term “algorithmic composition” in the literature and how it is used in the report and
1
Chapter 1. Introduction 2
how it relates to this project, followed by a background introducing the different as-
pects of the project and a literature review describing the previous work done in the
field of algorithmic composition and conceptual blending.
This will be followed by a chapter describing the methodology and different ap-
proaches by which different model were tackled. The section will also describe the
different key aspects of the models, including the training data sets, the digital repre-
sentation of the musical data, the artificial neural network architecture used as well as
the hyper-parameters used in the models. The section on the data used will also discuss
some anomalies present in the music pieces comprising the datasets and the cleaning
process of the data.
Building on the methodology, the next section will describe the experiments and
training of the models specified earlier and comment briefly on the training process.
Also in this section, different iterations or design decisions in the evolution of the final
models will be discussed and will be commented on to shed light on why they were
discarded during the design and implementation of the project. These iterations include
different rules by which the music is generated, different representations of the music
as well as other aspects that changed, sometimes drastically, along the course of the
project.
The decision to include these passing models and commenting on them and why
they were seen to not be satisfactory in the experiments chapter, as opposed to method-
ology, was not arbitrary. It did not seem fitting to include them in the methodology
chapter as they do not directly relate to the final implementation and methods used
in the final outcome that was actually scrutinized and evaluated. The importance to
mention them and why they were not satisfactory, however, remained and dedicating a
section in the experiments chapter seemed most suitable, as trying out these different
models and design decisions falls under the category of ‘experimenting’ for the best
output of the model and contribute greatly, through trial and error and the consequent
evolution of the model, to the final outcome.
The methodology and experiment details will be followed by an evaluation chapter.
The evaluation of the different models will be done by analyzing the pieces musically
and examining the properties of the different styles as well as using a subjective eval-
uation. The subjective evaluation will consist of user surveys in which they listen to
the different pieces and rate how successfully they model the style given or, in the case
of the conceptual blending model, how well it blends the two styles into one novel
musical product.
Chapter 1. Introduction 3
The report will be concluded by a final word giving a better idea of the context
in which this research falls and where the idea originated, then a brief reiteration of
the results to sum up the paper, followed by a final future works section suggesting
possible future directions for research.
Chapter 2
Background and Literature Review
2.1 What is Meant by Algorithmic Composition
By way of a much needed preamble and before introducing some of the history and
previous work on algorithmic composition, a word needs to be said on the different
uses of the term ‘algorithmic composition’ in the literature and how it will be used in
this paper.
Nierhaus (2009), in his very important, comprehensive book on algorithmic com-
position, makes the distinction between “algorithmic composition as a genuine method
of composition,” in which the complete overall structure and the music is determined
exclusively by the algorithms or system in place, and “style imitation,” in which “mu-
sical material is generated according to a given style,” which is mostly used as an
attempt for music analysis and verification by resynthesis. The latter, the way Nier-
haus describes it, is similar to the way grammars are used for parsing sentences and
verifying their linguistic soundness by resynthesis using rules and parse trees in natural
language processing.
Nierhaus also commented on this vagueness and the gray areas of the definition
between genuine composition and style imitation (as he defined it). Due to this unclear
line that separates them, it is often hard to categorize a certain algorithmic composition
system into one of these two categories, and the system proposed by this paper is no ex-
ception. However, I would argue that, since the machine learning system implemented
and described does in fact generate novel compositions completely autonomously and
all aspects of the compositions are determined exclusively by the system, and despite
the fact that it does ‘imitate’ a certain style from which it was trained, it falls in the
first definition of the word.
4
Chapter 2. Background and Literature Review 5
The fact that this system, and almost every algorithmic composition system in exis-
tence, even completely autonomous ones, have to gain their rules from a body of works
(which would, inevitably, have a certain style) also sheds light on the problematic
choice of words used in the definitions. A music composition system, unless explicitly
designed to generate completely novel pieces of music that fall under a completely
new never before seen ‘style’ or genre (a very strange, difficult goal), will necessarily
imitate a certain style of music, causing confusion when attempting to categorize the
system into one of these two definitions mentioned here.
It should also be noted that a third, arguably erroneous, definition of the word can
be gleaned when surveying some of the literature by music and arts researchers. In the
field of music and arts in general, the word “algorithmic composition” has been used
to refer to programs designed to aid composers with some aspects of the composition
process, such as visualizing music or assisting with score writing. The more suitable
term for this is “computer aided composition” and is unrelated to the topic at hand.
The following sections will provide a historical background and literature review
of the field and previous work undertaken in algorithmic composition, some of which
has been presented previously in the project proposal (2018) but needs to be reiterate
for emphasis and clarity and to put the project in perspective relative to the field.
2.2 Background
2.2.1 A Brief History of Algorithmic Composition
Since the first, most primitive occurrences of automation, the problem of algorithmic
composition, or using computers to “compose elaborate and scientific pieces of mu-
sic of any degree of complexity or extent (Bowles, 1970),” has been brought up. Ada
Lovelace, while working with Charles Babbage on the difference and analytical en-
gines around 1840, hinted at the problem (Herremans et al., 2014). However, despite
initial optimism and excitement, expectations were reduced and the problem proved
far more challenging than initially assumed, with computers not able to generate even
simple melodies anywhere near acceptable to humans (Conklin, 2003).
Music composition, and other creative tasks in general, is a complex and very elu-
sive task that is still to this day not entirely understood. It consists of “accumulated
individual experiences, cultural contexts, and inherited predilections swirling about
within the composer” to create a work of music (Todd and Werner, 1999), which is
Chapter 2. Background and Literature Review 6
not an easy challenge to approach and solve even with the most advanced artificial
intelligence. Over the years, many machine learning and probabilistic models have
been devised to try to model certain music styles statistically and achieve as realistic
results as possible. Solving this task not only solves the titular problem of algorithmic
composition in the sense of generating an aesthetically pleasing output that success-
fully captures a certain musical style as well as satisfying music theory constraints,
but achieving a satisfactory solution would have many consequent applications useful
for music information retrieval systems, like identifying composers, automatic music
transcription, finding a suitable harmony to a given melody, music segmentation and
pitch estimation (Raczynski and Vincent, 2014).
The “musical dice game” attributed to Mozart, but with no proper authentication
or justification to this attribution (Cope and Mayer, 1996), seems to be the earliest de-
scribed system of algorithmic composition using a probabilistic model. In the game,
the player rolls two dice and, based on the outcome, chooses 16 measures from pro-
vided measures precomposed by the composer, then, based on the roll of a third dice,
picks another 16 measures from a different set of bars to form the trio to the minuet
(Boenn et al., 2008). Despite the fact that it does use a statistical model and the number
from the dice roll to randomly ‘generate’ music, the creative task is still the work of
the composer who provided the measures, and not of the statistical model.
Following that, there was a significant hiatus period for the field of algorithmic or
automated composition for a long while, despite the initial interest evidenced from the
previous example and Lovelace’s mid-nineteenth century quote. This was a natural
consequence to the fact that any technology needed to even come close to achieving
this ambitious vision of music generation and analysis by an autonomous machine was
not to be seen for decades to come. Decades later, however, with the introduction
of digital computers in the mid-twentieth century and the sufficient advancement of
technology, this area of interest was immediately revived and brought back to the sci-
entific community’s eye, most notably by Hiller Jr and Isaacson (1957) in their 1957
composition titled The Iliac Suites, in which a Markov chain model paired with a mu-
sic composition rule system was employed. Iannis Xenakis is another notable early
contributor who famously used computers and stochastic methods to generate music,
some of which also used Markov model (Ames, 1987). Some of these methods will be
elaborated on later in this section.
Chapter 2. Background and Literature Review 7
2.2.2 Narrowing Down the Scope
While the idea of generating new music using a set of rules and generation probabilities
is not a new development nor is it a whim of the artificial intelligence research commu-
nity, it has undeniably flourished and resurfaced to prominence in the past few decades
due to lower computing costs and the consequent advances in AI research. Many AI
systems, with varying levels of complexity and success were devised to tackle the
problem of algorithmic composition as well as other subproblems of it.
These methods include grammars, rule-based systems, constraint satisfaction mod-
els, using Markov chains and statistical models, evolutionary algorithms, cellular au-
tomata, artificial neural networks and other methods from the field. This survey, how-
ever, only covers machine learning methods used for algorithmic composition, includ-
ing Markov models and artificial neural networks. More exhaustive surveys that cover
these topics, and more, are available, such as Fernandez & Vico’s (2013) Survey of the
field.
2.3 Markovian and Other Statistical Models
2.3.1 Motivation and Initial Attempts at Using Markov Models
Boenn et al. (2008) summarize the problem of creativity as it pertains to musical com-
position briefly as “where is the next note coming from?” To answer this question,
among the first intuitions was to use techniques developed applied for natural lan-
guage processing (Nierhaus, 2009) to create generative music models, as they are both
essentially a sequence generation problem. From here, experimenting with Markov
chains and other Markovian model variations to generate music started.
Harry Olson was among the first to experiment with Markov models in musical
applications (Nierhaus, 2009). Olson Analyzed 11 pieces, standardized to the key of
D, and produced first and second order Markov models for pitch and rhythm (Olson,
1967). Upon examining the results, Olson asserted that the second order model pro-
duced better melodies and better resolutions for the generated music.
Hiller Jr and Isaacson (1957) experimented with algorithmic composition to gen-
erate the Illiac Suite, in which the forth movement, or “experiment,” as they referred to
it, used variable order Markov chains to model different aspects of the music like skips
and sound textures to generate the notes.
Along with the aforementioned efforts, another pioneer in computer aided com-
Chapter 2. Background and Literature Review 8
position, Iannis Xenakis, was concurrently working with Markov models to generate
music (Nierhaus, 2009). He used Markov chains to generate a transition probability
table of different “screens.” The screens encoded different dynamics and instrumenta-
tions that were to be used at different points in the final music composition.
2.3.2 Recent Success of Markov Models
Among the more recent experiments in music generation using Markov models is the
experiment done by Ponsford et al. (1999) in which he used a corpus of 84 seventeenth
century sarabande (a simple dance form in triple time) pieces to generate acceptable
pieces. The experiment included a preprocessing step to annotate the music with the
explicit harmonic structure of the pieces and a post-processing step paired with a set
of grammar rules to ensure acceptable results. It is worth mentioning that the prepro-
cessing step in which the pieces are annotated with the harmonies is almost perfectly
analogous to part of speech tagging in natural language processing.
Ponsford et al.’s system achieved relatively successful results, with most pieces
starting and ending in the same key, as well as resolving with a perfect or imperfect
cadences, a characteristics that all of the corpus pieces used for training have. It should
be noted that the evaluation methods used by Ponsford et al. are completely subjective
and depend on human perception and the evaluators’ judgment as to whether the pieces
resemble the training corpus or not. This is common in generative models of music or
creative AI systems dealing with similarly subjective tasks. This evaluation problem
will be elaborated on later and used in our proposed system.
Another successful implementation was by Pachet and Roy (2011), where they im-
plemented a mixed model making use of a combination of Markov processes as well as
a constraint system. The set of constraints addressed an interesting problem, since they
ensured that even when applied the underlying probabilities of the Markov model are
satisfied and reflected in the results, a quality that other models up to that points could
not uphold and would usually violate the Markov property upon applying constraints.
While the model does generate subjectively agreeable variations, continuations and
answers to given melodies, the paper contains little perusal or examination notes re-
garding the evaluation, perhaps due to the assumption that, since it is satisfying the
constraints, the system at least generates minimally acceptable results. However, this
lack of evaluation remains a significant shortcoming especially in a machine learning
paper dealing with computational creativity, where the evaluation, examination and
Chapter 2. Background and Literature Review 9
fitness of the output is of utmost importance.
2.3.3 Shortcomings of Markov Models
While Markov models can achieve favorable results, as shown, and have proven useful
as a tool of probabilistic music style modeling, they still have an inherent inability
to capture important temporal features and long term dependencies present in music,
such as motif repetition and stylistic variations on a theme (Quick, 2016). To tackle
this problem, there has been experimentation with variations on Markov models as
well as combinations of it with other non-markovian models.
One notable attempt is the system devised by Martin et al. (2010) that employs
Partially Observable Markov Decision Processes(POMDP). POMDP is a generaliza-
tion of hidden Markov models that takes actions and interactions between the agent
and the environment into consideration when choosing the next state, to create a sys-
tem that improvises alongside a musicianship. Martin et al. used a simple evaluation
function based on the key in which the generated improvisation is in and whether it
is the correct key expected or not and achieves good results. This was an interesting
effort since POMDPs were never used in this context of algorithmic composition or in-
teractive music generation before. The experiment also marked a certain move outside
the conventional Markovian language modeling of music and into other realms.One
such real of machine learning that has proven very successful utilize Artificial Neural
Networks (ANNs). These will be discussed in further detail in the next section and
will be the main focus of the project.
2.4 Artificial Neural Networks
2.4.1 Motivation and Definition of ANNs
ANNs are a biologically inspired machine learning model that allows for problem solv-
ing by manipulating and changing the structure of internally connected components
(Nierhaus, 2009) and the different weighed sums and connections between them. As
Nierhaus points out, employing neural networks for sequence generation to compose
music allows for surprising models that follow the underlying qualities of the corpus
and, unlike generative grammars or Markov models, do not blindly generate transitions
observed in the training corpus. This quality of detecting subtleties resulted in many
Chapter 2. Background and Literature Review 10
attempts, some to notable success, to tackle the problem of algorithmic composition
using ANNs.
2.4.2 Success of ANNs in Modeling Musical Structure
One of the first systems to seriously employ artificial neural networks for the purpose
of music composition is HARMONET, developed by Hild et al. (1992), which aimed
to generate four part harmonies for J.S.Bach’s choral lines in his style, a popular com-
position exercise for music students. HARMONET uses recurrent neural networks
(RNN), in which the output of one time-step is used again as an input to the next one,
to achieve the task. The neural network has 106 input units representing the current
harmonic and melodic contexts, a boolean representing whether the current note is a
stressed beat or not, and the relative position of the current harmony relative to the
beginning or end of a musical phrase and was trained on two sets of twenty Bach
chorales in both major and minor keys. The output of the recurrent neural network,
denoted in the paper as Ht , represented the generated harmony at the current time-step,
or beat, and would be fed into the network as the current harmonic context and used
along with the other previously mentioned inputs to generate Ht+1. This is paired with
another, simpler neural network not discussed elaborately in the original paper which
is responsible for adding ornamentation to the output of the RNN.
Harmonies generated by HARMONET were judged by ”an audience of musical
professionals” to be on the level of a professional musician and suitable for musical
practice (Hild et al., 1992), which is why it bears significant importance for the field
as one of the first experiments using ANNs, and more specifically RNNs, to yield such
successful results.
Recurrent Neural Networks will be introduced in further detail later, as our music
composition model will use Long Short-term Memory (LSTM) network, a type of
recurrent neural network with certain desirable features for the music composition task.
Adiloglu and Alpaslan (2007) also achieved impressive results with a relatively
simple neural network model. Their ”Neurocomposer” system used a simple feed for-
ward neural network to generate a first species counterpoint to a given melody. What
is interesting in the model is the input representation of music that they used, since
representation in these models is very important and instrumental to the performance
of the network. NeuroComposer representation denoted the pitch height, location in
the circle of fifth and location on the chromatic circle. The model was also subjec-
Chapter 2. Background and Literature Review 11
tively evaluated and it was shown that the model’s output generally obeys the rules of
counterpoint, while another evaluation by three professional musicians asserted that
the pieces do not consistently obey the rules, but that the output is generally correct.
2.5 Conceptual Blending
2.5.1 On Conceptual Blending
Creativity, among many competing definitions, is defined as the process or phenomenon
by which we produce a useful product that is previously unseen or unobserved (Mum-
ford, 2003). Boden (2009) divides creativity into three models by which we can gen-
erate this novel outcome: exploratory creativity, in which new regions of a concep-
tual space are examined; transformational creativity, in which established concepts are
manipulated in original, previously unseen ways; combinational creativity, in which
conceptual spaces that are seemingly not directly related are somehow linked and as-
sociations made between their elements. Boden also asserts that the most difficult to
formally describe is combinational creativity.
Directly related to this is Turner and Fauconnier’s (2008; 2014) cognitive theory of
conceptual blending, in which they describe the merging, or ”blending” of elements or
aspects of two conceptual spaces to create a novel concept which combines aspects of
both of the spaces from which it is blended. This new blended space, besides retaining
some of the qualities of the original spaces, possesses novel properties that are inter-
pretable in a way that allows for better understanding of the constituent spaces or the
emergence of a completely new idea or concept (Kaliakatsos-Papakostas et al., 2014).
Hadjileontiadis (2014) asserts that, despite the relative simplicity of generating
novel products by blending old ones, the difficulty lies in applying this combinational
creativity or blending of concepts in a “computationally tractable way, and in being
able to recognize the value of newly invented concepts for better understanding a cer-
tain domain; even without it being specifically sought, i.e., by ‘serendipity.’” This dif-
ficulty makes it challenging to construct a ‘universal’ or generic conceptual blending
model that can be successfully applied to any domain. Another difficulty that arises as
a consequence of this and as a result of the sometimes vague nature of what a suitable
blend between two spaces might be is evaluation of the outcome of the blending model.
This will be elaborated on further in the evaluation section.
Further on conceptual blending, Goguen (2006) , unlike preceding proposals, pro-
Chapter 2. Background and Literature Review 12
vides an explicit computational account of blending as well as an interesting mathe-
matical framework for representing space and time that supports conceptual blending
in several “metaphorical” ways and explores four examples showing different kinds
of ambiguity in this process of conceptual blending. The examples include geometric
conceptual spaces, a puzzle that incorporates conceptual blending as a solution, the fa-
mous example of the different possible blends of “boat” and “house” and their implied
conceptual spaces as well as temporal reasoning using spatial analogy. The framework
provides important insight on conceptual blending and the related cognitive process.
2.5.2 Conceptual Blending in Music
Despite the fact that conceptual blending is a relatively new theory, there has been a lot
of interest in it as it relates to computational creativity, especially in music, since its be-
ginning. Kaliakatsos-Papakostas et al. (2014) have tackled the problem of conceptual
blending of harmonic spaces and achieved very satisfactory results. They attempted
conceptual blending on the chord level, the harmonic progression level as well as on
the harmonic structure level (in which they created a ”blendoid” of two harmonic sys-
tems, namely the symmetric harmonic space and the diatonic space). They also briefly
comment on the ”meta-level” harmonic blending, in which more metaphorical connec-
tions between different domains are examined. They discuss this meta-level blending
as it pertains to emotions and moods conveyed by the chords.
Similarly, Cambouropoulos et al. (2015) further tackles the problem of harmonic
space blending and also tackles chord level, sequence or progression level and har-
monic structure level blending. In addition, the paper tackles scale level blending,
in which intervals from two different scales are blended to give rise to a new scale
or mode, which happens frequently in jazz. They also give examples of melody-
harmony blending and provide very interesting demonstrations showing The Beatles’
song Michelle and a Bach chorale melody harmonized in different styles.
Moreover, Eppe et al. (2015) and Zacharakis et al. (2015) use a similar approach of
chord blending to invent novel, blended cadences, often giving rise to very interesting
substitutions or transition chords. They also describe their evaluation method which
will be commented on later in the evaluation section.
Chapter 3
Methods
In this section, the datasets used for training the models, their features, the preprocess-
ing and cleaning undertaken before using them as well as the digital representation of
the data and how it fits into a machine learning framework. The model architectures
used will also be discussed and, in the case of the conceptual blending part of the
project, the blending hypothesis of the models will be described.
3.1 Data
3.1.1 The Bach Model
For the first generative model, the style, or genre, of choice was decided to be baroque
music, and, more specifically, in the style of Bach, Baroque period’s most prominent
composer. For this purpose, it was decided that the first of his famous suites for solo
cello will be used for training the AI machine learning model.
The first suite was chosen because it has the following desirable qualities that
would facilitate training and fit the model used, which will be described in a later
section of this chapter:
• It consists of six pieces, all of which are in the key of G major. This uniformity is
crucial, as the model has no way of explicitly distinguishing the key of a piece.
As a result, unless the system is augmented with elaborate rules or a different
representation to define the concept of a harmonic key, the model is likely to be
confused with the different sets of notes in different pieces steering the training
in different directions and, consequently, might end up generating pieces with
13
Chapter 3. Methods 14
random accidental notes that don’t fit a certain key and sound dissonant and
unpleasant.
• The pieces are rhythmically simple and contain no complex durations or beat
divisions, a characteristic beneficial for the digital representation of music that
will be used in the model, discussed later.
• The cello suites are all solo pieces and, therefore, monophonic. This means that
only one line is written for a single instrument at any given time. There are a few
exceptions in which the cellist is expected to play a chord or a bass note with the
solo line, but these are very few and were discarded in the data cleaning process.
It should be noted that the representation used allows for polyphonic music to be
learned and generated, but since the models discussed are concerned only with
melody, which is a solo line, harmony and polyphony will not be addressed here.
• Widely regarded as his greatest and most famous work, the cello suites are
Bach’s canonical work that best capture his style and the baroque genre in gen-
eral, while still being monophonic as previously discussed. Another conse-
quence of this, the suites are easily available in MIDI format and in the public
domain, making them feasibly usable for the purpose of this project.
3.1.2 The Traditional Folk Music Model
The style to be modeled by the second AI model was decided to be traditional folk
music reels. The reel is a folk dance originating in Scotland that is also performed in
The British Isles and North America (Haigh, 2014).
For this purpose we will be using pieces extracted from the Nottingham Database1, a database containing over 1000 British, American and Irish folk pieces, including
jigs, reels, hornpipes and waltzes. The first eighty pieces were chosen from the reels
in the database. These were chosen for reasons similar to the ones mentioned earlier
in the previous section.
The pieces, however, were not as uniform as the Bach cello suites and required
some preprocessing and cleaning and elimination of some pieces from the training set.
These will be discussed in the following section.
1https://ifdo.ca/˜seymour/nottingham/nottingham.html
Chapter 3. Methods 15
3.2 Pre-Processing and Data Cleaning
The music acquired was in MIDI format, a file format specifically designed for the
storage and transfer of music which makes it easy to modify and manipulate music
(Swift, 1997). A MIDI file contains messages, which carry events including pitch,
velocity and many other aspects of musical events as well as events describing music
notation like key and tempo (Miles Huber, 1991). How this format was used and the
relevant information extracted from the events to be converted into the representation
used in the model will be discussed in a later section dedicated to the topic.
The Bach pieces contained minimal irregularities since, as previously discussed,
they were uniform for the most part, with the exception of a few cases in which there
was a bass note to be played. The MIDI files provided, however2, already included
the occasional extra bass notes in a different part (since MIDI also describes music
notation, it includes different parts to be written on separate staves), which made it
easy to only extract the part with the main solo melody from the midi file. The reels
were not as uniform and therefore required more preprocessing and cleaning.
The folk reels contained many irregularities, which would disrupt the training pro-
cess and confuse the model, as explained earlier. Some of the irregularities could easily
be fixed by preprocessing with minimal change to the true piece and some were highly
disruptional and cannot be changed without drastically changing the piece. In case of
the latter, it was decided to discard the pieces containing them, as they were very few
pieces and the undesirable qualities were not characteristic of the overall folk style we
are trying to model.
The fixable irregularities consisted of two things, listed below:
• The pieces were in different keys, which is completely normal and expected,
since, unlike the Bach pieces, they were not composed as a collection associated
with each other, but were composed over many years and many different geo-
graphic localities and merely compiled in this dataset. This was fixed by trans-
posing all the keys to the key of G major and, in the few cases were the pieces
were in a minor key, E minor, the relative minor key of G major (containing the
same set of notes).
• occasional irregularities in rhythm, such as an occasional triplet (a beat divided
into three notes of equal duration). In this case, the triplet was converted into an
2Retrieved from http://www.jsbach.net/midi/
Chapter 3. Methods 16
eighth note followed by two sixteenth notes, as shown below. This is due to the
fact that, since a triplet is an irregular rhythm and cannot be represented exactly
(1/3 is an infinitely repeating decimal 0.333...), it cannot be expressed in the
chosen representation, in which equally dividing the beats is crucial.
(a) Before
⇒(b) After
Figure 3.1: Triplet preprocessing
The irregularities that were not fixable without crucially changing the piece oc-
curred in five pieces, which we chose to eliminate, bringing the total size of the train-
ing set from the initial eighty one pieces down to seventy six. The irregularities, along
with some examples were:
• Some of the pieces contained polyphony. Even though they are supposed to be
folk melodies, few pieces contained parallel lines at points, which were unde-
sirable for the reasons discussed in the previous section pertaining to the Bach
suites. In this case, it was decided to discard the pieces as extracting one of the
two lines wouldn’t be as easy as in the Bach pieces and they wouldn’t have a
significant effect on the training.
• Irregular beats and strange time signature changes. Pieces containing strange
beats and note durations that cannot be easily resolved like the eighth note triplet
case were discarded.
• Changes in time signature. One of the pieces contained a sudden, very unusual
time signature change to an odd signature that lasted for one bar, which also con-
tained an uneven beat that would not be able to be captured in our representation.
• Changes in key signature. One of the pieces contained frequent changes in the
key signature back and forth throughout the piece before finally resolving back
in the original key. This would make it unfeasible to restrict it to one key even by
transposition, as the two different harmonies will persist in the intervals anyway.
Chapter 3. Methods 17
As this is certainly not a quality that exists prominently in this kind of music,
this piece was removed.
(a) Polyphony (b) Odd durations and beats
(c) Key signature change midway through the
piece
(d) Time signature change
Figure 3.2: Data irregularities
3.3 Data Representation
The representation chosen for the model is a variation of piano-roll notation. Piano
roll notation is a notation that attempts to densely represent the MIDI digital score
format (Mao et al., 2018) by encoding the musical piece as a binary matrix in which
the columns represent the time step and the rows represent the note range allowed.
The resolution, or smallest possible time step, chosen, which fits both musical styles
(with the exceptions of the aforementioned anomalies) was a sixteenth note, which
represents a quarter of a beat. For example, in one standard 44 bar in a system that
allows 4 different notes, the matrix would have dimensions 4×16.
A problem that arises from this representation is that it is inherently ambiguous. A
note sustained for a certain duration is different than a note repeatedly played in each
of the time steps for that same duration, but would both be represented the same way
in this matrix. While other papers that used this representation resolved this problem
using an extra ‘replay’ matrix (Johnson, 2017a; Mao et al., 2018), due to the simpler,
monophonic nature of the melodies being learned and generated by this model, it was
Chapter 3. Methods 18
decided to go with a simpler method of augmenting the matrix with a ‘sustain’ row
column, in which a bit is switched on when the last note is sustained and off otherwise
(either when another note is being played or when there is a rest, indicating a time
step of silence). The following figure shows the difference between a long note, a
repeatedly played note and a short note played at the beginning of the beat followed by
silence in a range of four allowed notes. The top most row indicates the sustain.
M =
[0 1 1 10 0 0 01 0 0 00 0 0 00 0 0 0
](a) Sustained
quarter note
M =
[0 0 0 00 0 0 01 1 1 10 0 0 00 0 0 0
](b) Four repeated
notes
M =
[0 0 0 00 0 0 01 0 0 00 0 0 00 0 0 0
](c) One sixteenth
note
Figure 3.3: Different note representations
3.4 Models
3.4.1 Recurrent Neural Networks and LSTMs
For the machine learning model used, a conventional dense layer neural network archi-
tecture architecture does not suffice, since music is a sequence of interconnected notes
and recurring themes that cannot be modeled as just an input fed into a static fully con-
nected neural network for the corresponding output to be generated. This problem of
modeling sequences has been studied in depth in the field of machine learning, since
a lot of things exists that are sequential in nature, such as speech, motion or music
(Boulanger-Lewandowski et al., 2012). Music is also particularly complex to model
due to the long term dependencies that occur in its sequences, which manifest in re-
peating harmonic progressions or a recapitulation of a melodic motif and the many
repeats typically involved in a piece of music.
Recurrent neural networks (RNNs) can theoretically solve this problem by, ideally,
keeping track of the long term history that occurred up to the most recent point in the
sequence and taking it into account during gradient back-propagation, incorporating
this data into its training process in which it learns from data. This, however, is prob-
lematic due to what is referred to as the vanishing gradient effect. Despite its ability,
in principal, to summarize an entire sequence history, Bengio et. al (1994) showed that
this is unachievable practically using gradient-descent methods, which is the standard
Chapter 3. Methods 19
method for machine learning algorithms and, consequently, not all of the sequence
history is effectively incorporated into the process of tuning the model parameters to
successfully capture the given style or sequence.
To remedy this problem, many approaches have been suggested, most of which
are succinctly surveyed and summarized in Schmidhuber and Hochreiter’s 1997 pa-
per ‘Long Short-Term Memory (LSTM),’ in which they introduce the titular variation
on conventional RNN, which proved immensely successful at solving the problem.
LSTMs make use of memory cells and gated units instead of the conventional hidden
units used in the normal RNNs. These have proven to be very successful in learning
long term dependencies and repeated sequences and was demonstrated successfully
on the task musical composition and learning a blues structure successfully and abid-
ing by it (Eck and Schmidhuber, 2002). Since then, LSTMs and variations on them
have been used almost exclusively in the field of algorithmic music composition using
neural networks and to great success and interesting results. For these reasons, this
project makes use of LSTM as the main functioning layer in the deep learning model
proposed.
3.4.2 Hyper-parameters
The model used for this project for modeling both Bach and Folk music is identical
except for the input dimension which is slightly different since the range of notes
present in the Bach pieces is slightly smaller than the that of the folk reels. The hyper-
parameters of the models, listed below, were chosen using conventional manual search
and fine tuning based on the observed musical output, due to the nature of the task and
the training set, which does not allow for a validation set to be used for determining
hyper-parameters. The hyper-parameters of the model, or any deep learning model,
are listed below, along with the values chosen.
• Number of layers. In this model, there are six layers stacked in the following
order: input LSTM layer → dropout → LSTM → dropout → dense, or fully
connected, layer→ softmax activation layer.
• Number of hidden units. The first LSTM layer has 128 hidden unitswhile the
second one has 64 units. These values, which are also relatively popular in the
literature, seemed to capture the dependencies best during training and generate
the best musical output.
Chapter 3. Methods 20
• Dropout rate. For both dropout layers, the dropout value chosen is 0.5, as used
in Johnson (2017b) and Moon et al. (2015).
• Optimizer and its parameters. The training was optimized using RMSProp with
with a learning rate of 0.001 and Nesterov momentum of 0.9, as used in John-
son (2017b). These are the conventional values mostly used by default for this
algorithm and were not changed.
• Batch size and number of epochs. 400 epochs of training with a batch size of 128
were used. The epoch number is overestimated, which has no impact since early
stopping was used. This will be elaborated on in the next chapter describing the
experiment.
Figure 3.4: Bach model architecture
3.4.3 Sampling the Next Time-step
During training, the final softmax layer learns the probability distribution of the next
note based on the previously observed sequence and, during prediction, transforms the
output of the final dense layer into a probability distribution Pnextnote using the softmax
function, which is a normalized exponential function (Bishop et al., 1995). For dense
Chapter 3. Methods 21
layer output φ and k possible notes, the output of the softmax activation layer for note
i would be:
Pnextnote=i =exp(φi)
∑kj exp(φ j)
, (3.1)
Given this distribution Pnextnote, we sample the next note, concatenate it to the
piano-roll sequence, move the history window one time-step ahead, then repeat the
process for the next prediction. Due to computational restrictions and the training cost,
the model only learns based on the previous 40 time-steps, which is two and a half bars
back in time.
It should be noted that, included in the probability distribution Pnextnote along with
the notes in the range, the probability of the sustain bit it also calculated. If sampled,
the sustain bit would mean that no note is to be played and the last note would be
elongated by one sixteenth note duration. The functionality of sampling a rest (a beat
of silence in which no note is played and also the sustain bit is off, which cuts the
previous note short and results in a time step of silence) is not included in the model,
as the training data did not have rests except in very rare occasion and it was decided
that they are not indicative of the style for practical purposes.
3.4.4 Conceptual Blending
Given the two generative model, the next step was to investigate different methods
of applying conceptual blending to them. This is not a trivial task, as the theoretical
concept of combinational creativity, which humans do seemingly trivially and on a
daily basis, is difficult to recreate computationally (CoInvent, 2017). To tackle this
task, two different approaches of combining the models were hypothesized, one of
which proved more successful than the other.
The first proposed method of combining the models was intuitive. A seed would
be given to one of the two models and its predictions will be sampled to generate
two measures of music (32 time steps), then the active model will be switched and
the other model would generate two measures and so on. This method of alternating
between the models to extrapolate the melody seems naive at first glance, but, upon
examination considering that the history window upon which the prediction is made is
40 time-steps (two measures and a half), the plausibility and justifiability of the model
becomes clearer.
Chapter 3. Methods 22
Since the prediction is based on a widow of length longer than the sequence gener-
ated by the previous model(32 time-steps), this means that the output of the previous
model, as well as the output of the current model from its last iteration will also be
taken into account when predicting the sequence of notes and not simply extrapolating
the previous models output. This results in the output of the sequences being pre-
dicted increasingly upon the mixture of both models and, in an accumulative process,
generating better blends as the sequence length increases.
The second model is theoretically driven, from a probabilistic viewpoint, but did
not prove very successful practically.It makes use of the probability distribution gen-
erated by the predictive models and attempts to combine them to generate a blended
probability distribution that captures the prediction of both models. Starting with a
seed, the two models would be fed the history sequences in parallel, then the generated
probability distributions over the notes would be added or multiplied (both methods
were attempted) and normalized then a note will be sampled from this acquired joint
distribution and the process repeated.
As previously stated, the input and output lengths of the models are not equal, since
the range of allowed notes of a model is determined by the highest and lowest notes in
its training corpus, and it was observed that the difference in range was not significant,
as the folk music model’s range was 3 notes more than the Bach model. This discrep-
ancy in the range and, consequently, input/output vector length was resolved, based on
the situation, as shown below.
• When feeding the Bach output to the Folk model, the length of the vector was
increased to fit by padding the note indices not present in the Bach model with
zeros before feeding the sequence to the Folk model.
• In the inverse case, where the Bach model is expecting a shorter input length,
the notes outside of its range were clipped off and only the range in the range it
learned were fed to it. This is a shortcoming of the model, since some informa-
tion is omitted, but it is assumed the notes that are outside the range are extreme
pitches(too high or too low) and would appear very infrequently, if at all, in the
Folk model’s predictions in the first place. These sporadic, very sparse cases
where they would be present, are assumed to not greatly affect the prediction
by the Bach model, as it is likely to only be one time-step at most from the 40
previous time-steps it bases its prediction on.
• In case of adding or multiplying the two predicted distributions, the prediction
Chapter 3. Methods 23
of the shorter output (Bach) was padded with zero, indicating a zero chance of
the notes outside its training to be played. This would give the unshared notes a
probability of zero in case of multiplying the distribution and, in case of addition,
a low probability of being played relative to the other notes, whose probabilities
are a sum of two numbers as opposed to being a sum of one probability and zero.
Chapter 4
Experiments
In this chapter, the details of running the experiments and training the generative mod-
els will be discussed as well as some design decisions that were attempted while build-
ing the music generation system but were discarded as the design and implementation
evolved.
While the chapter divide at this point is seemingly arbitrary and the information
presented herein could be incorporated, based on its nature, into the methodology and
evaluation chapters, it seemed more fitting, and convenient for reference, to enclose
all information pertaining to the direct running of the experiments and the learning
process and model performance in a separate chapter, as these are distinct from the
model architecture and design aspects. Similarly, it was decided that the different
design approaches and model iterations, their shortcomings and the reasons they were
discarded fall under the category of ‘experimentation’ during the system design and,
consequently, were included in this chapter. These unsuccessful iterations are only
included for thoroughness and for the intermediate stages of the project and evolution
of the model to be reported and for their shortcomings to be noted and documented for
future research.
4.1 Loss Function
Since the output of the network is supposed to model a probability distribution, the
loss function used for training the model is the categorical cross-entropy error. Cross
entropy measures the discrepancy between a predicted probability distribution q and a
true, or target, probability distribution p. ‘Categorical’ cross entropy is the name given
to a special cross entropy loss function available in the Keras deep learning library
24
Chapter 4. Experiments 25
(Chollet et al., 2015), used when the target distribution is one-hot encoded, meaning
one bit, or ‘class,’ is switched on and the rest are set to zero, which is true of our
representation (refer to section 3.3). The cross entropy for discrete p and q is defined
as shown in equation 4.1.
H(p,q) =−∑n
p(x) logq(x) (4.1)
4.2 Training the Models
The model architecture described was implemented using Keras and Tensorflow and
the training ran for 400 epochs with a batch size of 128 and early stopping with a
patience of 5. This patience value indicates that training will be stopped after five
epochs where there is no improvement in the loss, indicating the training is no longer
sufficiently effective.
After running the training scripts with the early stopping callback feature, the Folk
model stopped at epoch 159 with a final loss of 0.1878 while the Bach model stopped at
epoch 177 with a final training loss of 0.3690, as shown in the plots below, illustrating
training loss against epoch. This difference could be due to the small Bach training set
compared to the traditional tunes (see section 3.1) or the complexity and variation of
Bach’s music as opposed to the traditional folk tunes, which are mostly simple melodic
lines following basic harmonies with few accidental notes and variation in any given
piece. These factors are likely to makes Bach’s music harder to model and predict.
This loss serves training and comparing models well, but, while indicative of how
well the learned probability distributions match the original style, is hardly sufficient
for evaluating the pleasantness of a musical output. For this reason, the loss values
are only mentioned in this section while describing the training and not mentioned or
analyzed further in the later chapter on evaluating the output.
Chapter 4. Experiments 26
(a) Folk model
(b) Bach model
Figure 4.1: Model training losses against epochs
Chapter 4. Experiments 27
4.3 Intermediate Stages and Experimentation
During the design process, and over time, before arriving at the final design decisions
and architecture described in the previous chapter, different approaches were imple-
mented and tested with varying levels of success and failure. These are listed and
described below.
• Different model parameters. Before arriving at the architecture described, dif-
ferent parameters of the neural network were tested, specifically relating to the
number of hidden units in the LSTM layers. The first LSTM layer initially had
64 recurrent units (matching the second layer), as opposed to 128 in the final
model, but did not achieve interesting or pleasant results that properly emulated
the training music. This is possibly due to the fact that the less the number of
units, the weaker the representation power of the network is and, therefore, the
earlier architectures with smaller hidden layers could not capture the complex
nature of the musical styles and the relationships between the notes.
Another different parameter used in earlier stages was the dropout rates in the
two dropout layers. Dropout rates of 0.2 and 0.4 were tested, but, upon examin-
ing the output, the network seemed more prone to overfitting its weights during
training to the music, resulting in a lot of musical phrases exactly copying seg-
ments in the training data. This is not the desired goal of the system, as the goal
is to model the styles as closely as possible but not copying the music exactly.
• Different data representation. The initial data representation used was much
simpler, and arguably significantly more primitive and naive, where the music is
represented as a series of vectors of length 2, with the first value being the midi
integer value for the pitch and the second being the float value representing the
duration of the note (a quarter beat being 0.25, a whole beat 1.0, etc). This re-
sulted in a very messy output not faithful to the correct key of the training pieces
and outputting mostly accidental notes outside of the correct key and harmony,
sounding very chaotic and unpleasant. This is because the final layer is a dense
layer outputting an approximated value as close as possible to the pitch value
believed to be true (as opposed to the softmax layer giving a probability distri-
bution over the note classes), as in a regression task, which is usually much less
accurate than the classification task implied by the one hot encoding of the piano
roll notation, where the different possible notes in the next time-step are mod-
Chapter 4. Experiments 28
eled as different classes. As a result of this, an explicit rule-based system had
to be implemented through which the output is passed and the pitch values are
rounded to the closest pitch in the scale. This is undesirable and inaccurate as it
forces the whole piece outputted to fit in the seven notes of the scale, while the
original pieces did have accidental notes from outside the main tonality of the
key, but used properly and in a calculated fashion without sounding dissonant
and out of place.
Moreover, the duration values predicted were mostly impossible fractions and
durations lying between the normal segmentation of the music and were simi-
larly forced to be rounded to the nearest quarter, which also adds inaccuracy and
unfaithfulness to the style learned and outputted by the neural network. This
data representation also does not allow for a probabilistic sampling of the next
time-step, since the output is deterministic and does not allow for uncertainty
or randomness in the sequence generated, the shortcomings of which will be
discussed in the next point.
• Different time-step determination method. As previously discussed and even
when deciding on the piano roll and softmax implementation, the initial method
for determining the next note was deterministic. Given the softmax probability
distribution output, the note with highest probability was automatically sampled
and added to the generated sequence, resulting in the piece with maximum like-
lihood being generated. This resulted in uninteresting results and the same musi-
cal output being generated at every run of the models. This lack of stochasticity
also resulted in some overfitting, since the most probable notes were always
picked and, consequently, some segments of the music were exactly copying
segments from the training data. Observing this, the decision to make use of the
predicted distribution by sampling from it was taken and, as a result, some ran-
domness of the decision was added, resulting in a more interesting output that
easily got out of overfitting after a short segment of notes, since the stochasticity
decreases the possibility of overfitting over a longer sequence of notes.
Chapter 5
Evaluation
Having trained the models and achieved acceptable results (given the time constraint)
and, subsequently, tested the conceptual blending models hypothesized, the next step
was to evaluate the outcomes of the models in a way that conveys to some extent of
certainty how well the models perform and how acceptable the generated music is and
how much it captures the given musical style. Given the nature of the topic and the
data at hand, this is far from an easy feat, as previously stated when proposing this
project (2018), reiterated and elaborated on below.
5.1 On Evaluating Music Composition Models
A glaring, immediately observable complication on surveying the field of algorith-
mic composition, or computational creativity in general, is the lack of agreed upon,
objective evaluation methods by which to compare the generated outputs, and, con-
sequently, the models used to generate them. As is clear in surveys of the field, a
lot of papers choose to use human experts to evaluate the generated music, which is
not unreasonable, considering music is made primarily to be enjoyed by humans, and
not to be scrutinized by a rigid mathematical or probabilistic formula. Other papers,
such as Cherla et al. (2015) use cross entropy to probabilistically compare the pieces,
which is not necessarily indicative of aesthetic pleasantness, but allows for objective
comparison of different models, as used in the models proposed in this paper.
Some experiments use a combination of methods for evaluation, like the two step
evaluation described in Adiloglu and Alpaslan’s (2007) experiment, where they use
the rules of counterpoint to asses the pieces based on how many rules were violated in
the first step. Others use methods from natural language processing, like the overall
29
Chapter 5. Evaluation 30
probability of a given generated piece(similar to the overall probability of a sentence).
However, it remains clear that this is an open area of research as well as debate as to
whether it is fitting to come up with one objective measure of evaluation or leave it to
humans to evaluate the aestheticism of generated music.
Mozer (1994) discusses this point as it pertains to Neural Networks in particular,
given that ANN systems are mostly discussed informally and with no strict critical
measures, possibly due to the fact that their outputs are presumed to be successful
by definition. Nierhaus (2009) comments on this, restating our previously mentioned
claim that “this criticism points out a common lack in a number of publications in the
field of algorithmic composition.” This is definitely a gaping area in the field that needs
further discussion and examination.
5.2 On Evaluating Conceptual Blending Models
Since this part of the project builds primarily on the field of computational creativity
and music, but as it pertains to blending concepts and the resulting output of a model
that successfully does that, this significantly complicates the already existing problem
of evaluation in algorithmic composition. Moreover, since the output of two models
will be blended by a third, model, it would be unrealistic to evaluate this combined
output using the same metrics used by one of the two models. Due to this complication
and the nature of the problem at hand and its entanglement with human aesthetics and
subjective disposition, it seems reasonable to use humans to judge the outcome of the
blending model and whether it follows the rules of both styles which we are attempting
to blend and, consequently, sounds like the two styles simultaneously.
Zacharakis et al. (2015; 2017) address this problem of evaluating the output of a
conceptual blending musical system and use a similar approach for empirical evalua-
tion of the musical cadence blendoids. They used musically trained participants in two
different tests. The first was a “nonverbal dissimilarity rating listening test” and the
other was a verbal descriptive test that was more subjective and more useful “to assess
the qualities of the produced blends.”
Further to the problem of evaluating conceptually blended models, it is not hard
to imagine that, regardless of the application domain in question, it often comes as
a formidable mental exercise to think of a way to evaluate what constitutes a ‘good’
blend. To demonstrate this, the canonical example mentioned earlier (see section 2.5.1
of the ‘boat’ and ‘house’ conceptual spaces and their many possible blends, as dis-
Chapter 5. Evaluation 31
cussed in (Goguen, 2006), can be examined.
Given these two spaces and some (hypothetical) blends of them, as perceived by
different individuals, what, if any, would constitute an objectively better blend than
others? Would a house capable of floating on water be more acceptable than a boat that
also functions as a living space? Moreover, an even naiver, yet still valid, reflection on
the topic: would a yacht, fully equipped with all the luxuries needed to be perfectly
livable and indeed functioning as the living space for some hypothetical hermit, count
as an acceptable blend, or merely a variant of a boat with no ‘house’ aspect to it and is
therefore not a ‘blendoid?’
Upon examining the previous questions and the clearly subjective nature of the an-
swers, it could be reasonably concluded that attempting to find an objective evaluation
of a blend is unrealistic, and, at least in the creative application domains, a subjective
participant evaluation, like the one described above in Zacharakis et al. (2015; 2017),
is more suitable. The empirical participant evaluation will be the main evaluation used
to measure the model’s success.
The results of the evaluations of the models are presented below.
5.3 Analyzing Musical Outputs Against Training Data
Despite having concluded that, for a creative output, a subjective evaluation is best,
it was also appropriate to attempt to gain some quantitative insight on how well the
models captured the style and patterns present in the training data. The two models
achieved different levels of success for reasons that will be speculated and explained
when possible. The results of this analysis are presented below.
Observing the histograms of the pitch class frequencies shown below, a few defin-
ing characteristics can be readily observed. In both Bach and traditional music, and as
expected from basic music theory, the most frequent pitch is G, which is the central
tonality of the G major key. The notes that follow in frequency after that are, also as
expected, D and B, which, along with G make up the G-major tonic triad. After that,
the next most frequent notes are A and F#, which are part of the dominant triad (D,
F# and A), followed by C and E, which are part of the sub-dominant triad (C, E and
G). One exception is that Bach tends to use pitch class A very frequently, even more
than D and B, the two notes that make up the main triad after G, which is unnatural
(but better left to musicologists to analyze). Moreover, a few outside notes are used
infrequently in the pieces, namely C#, Eb, F, G# and Bb. These are pitches not in the
Chapter 5. Evaluation 32
main scale but used occasionally to make the music more interesting and unexpected.
Observing the pitch class frequencies in the models’ generated outputs, it is clear
that the folk model learned the correct key of the pieces and managed to generate music
with almost the exact same note frequencies as the original style without any explicit
rules to use notes from that key. The Bach model, on the other hand, did use a lot of
the notes successfully, but with some anomalies and strange frequencies not present
in the original Bach suites. It can be seen that the most used notes are D and C, as
opposed to the tonic G. It also used the outside notes Eb and F more frequently than
their counterparts present in the key of G major, E and F#, respectively, which is in
violation of tonal music theory and the Bach suites music style. This is most likely
due to the fact that the training dataset is too small (see section 3.1), which is famously
undesirable, as made abundantly clear in machine learning literature, coupled with the
fact that Bach’s music is very complex and the dependencies and patterns cannot be
easily learned.
Another pattern that becomes clear upon observing the weighted scatter diagrams
(which were produced after excluding triplets and complex durations that would be
removed in preprocessing) of the music is Bach’s persistent use of 16th notes (quarter
beats) and, less frequently, eighth notes. The folk pieces, on the other hand, use mainly
eighth and quarter notes, which is not unexpected given that they are usually simple
melodies that can be sung and played by relatively amateur musicians. It seems like
the two models did not manage to capture the note durations very accurately, as can
be seen by their scatter diagrams.The music of both models seems to choose one note
duration over the others and seems to always favor that note duration.
The note duration problem, intuitively, could be due to the data representation used
(see section 3.3), which results in the sustain bit being on too frequently during training
and is therefore given a lot of weight and very high probability during prediction.
However, while this explains the long sustained notes, it does not explain the emphasis
on a certain time duration for each model (half beat and two beats in the Bach and folk
models, respectively). It should also be noted that the two models would consistently
(at every run of the program, outputting different predictions) output a segment of
interesting music with varying durations more in line with the ones observed in the
pieces before getting stuck in generating these relatively longer duration notes.
Worth mentioning is the fact that this duration difference from the original music
is not particularly relevant or detrimental to the actual use of notes to model the style
since this is an issue of tempo that could be manipulated after the output is acquired
Chapter 5. Evaluation 33
(eg. two half notes at a certain tempo would sound the same as two quarter notes half
the tempo, jut notated differently). After being sped up, the pieces sounded pleasant
and relatively similar to the styles modeled. This, however, is an issue still worth
examining, as note durations and tempo is an important characteristic of music styles
and, consequently, music style modeling that would not have been disregarded if it
weren’t for the time constraint and the fact that this project is primarily concerned with
melody style as opposed to beat and rhythm.
The next section will present the results of the subjective empirical evaluation by
the participants regarding the two music style models as well as one of the two concep-
tual blending models proposed, namely the one in which the two models alternate in
predicting the measures. The reason the output of the other blending model, in which
the two generative models predict the next time-step in parallel and the two probabili-
ties combined (either by addition or multiplication), normalized then sampled from, is
that it was deemed not suitable to be presented or examined further as it mostly com-
prised of very long notes sustained over many bars and very rarely changing. It is not
difficult to conclude that this is due to the aforementioned already high probability of
sustaining a note, along with the fact that the two models are added together, giving
it an even bigger weight and probability when the new blended distribution is normal-
ized. The repeating note pitch can also be explained by the same reason, as the notes
more central to the key will have higher probabilities in both distributions, resulting in
a very high probability in the blended probability distribution.
Chapter 5. Evaluation 34
(a) Original Bach pieces (b) Original Folk pieces
(c) Bach model output (d) Folk model output
Figure 5.1: Pitch class frequency histograms
Chapter 5. Evaluation 35
(a) Original Bach pieces (b) Original Folk pieces
(c) Bach model output (d) Folk model output
Figure 5.2: Pitches and note lengths weighted scatter diagrams
5.4 Empirical Evaluation
For the subjective evaluation of a human audience, 11 participants, out of a bigger
number that was planned to be acquired but time did not permit, were given question-
naires and the music played to them and they answered some questions about it. The
questioned included rating the similarity of the music or how well they think the AI
model captured the style on a conventional five point scale, where 1 through 5 indicate
very poor, poor, average, good, and excellent, respectively. As for the conceptual
Chapter 5. Evaluation 36
blending model, a different five point scale was given, shown below, and the users
were asked to pick the number corresponding to the style they felt the blended music
represented and how fairly it represents the two given genres.
1. Exclusively folk
2. Predominantly folk, only incidentally similar to Bach
3. Equally folk and Bach style
4. Predominantly Bach, only incidentally similar to folk reels
5. Exclusively Bach
The results were generally favorable and in line with what has been observed up
to this point. The average rating of the users for the Bach and folk models were 3.27
and 3.55, respectively, indicating a slightly above average success of the models. The
results, shown below, also reveal a lack of a unique mode for the folk model and,
despite a mode of 3 for the Bach model, a notable spread of opinions over the spectrum.
This points further to the subjectivity of music and lack of uniformity in perceiving it,
making it difficult to asses such a system of algorithmic composition with anything
other than a large sample of participants.
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1 2 3 4 5
Bach Folk
Figure 5.3: Participant rating of the style models’ accuracy showing notable spread of
opinions
Chapter 5. Evaluation 37
The conceptual blending model similarly achieved very favorable results, displayed
below as a continuous bell curve to better visualize the interesting polarity of the par-
ticipant opinions. The average rating on the presented music style scale is 3.18, in-
dicating what could be perceived as a very successful, almost equal blend of the two
music styles. The slight skew to the right of the curve is possibly due to the relatively
fast tempo the blended output was played in, which made it sound similar to Bach’s
music, as one participant commented in an optional “open comment” question at the
end of the questionnaire that “the music is too fast to be folk.”
These results demonstrate the success of the conceptual blending model which is
likely to be equally successful in other, similar application domains in which a blend
between two probabilistic history-based sequence prediction models is desired. How-
ever, the possible looming factors that could result in inaccuracies during evaluation
should also be noted and taken into account when evaluating other models. In this
project these contributing factors included the tempo differences in the music (which
are in no way learned or decided by the model), which could affect the opinion of par-
ticipants with little musical backgrounds. This lack of musical background is in itself
another factor, as one participant with little music background anecdotally and off-
handedly remarked that he was merely guessing when asked which piece he believes
is the original and which is generated by the machine learning model.
0
1
2
3
4
5
6
1 1.5 2 2.5 3 3.5 4 4.5 5
Figure 5.4: Participant rating of the blending model showing polarization of opinion
Chapter 6
Conclusion
To fully comprehend the aim and scope of the project and put it in context, it is im-
portant to understand the motivation and train of thought that led to it, including the
series of events through which the project went before crystallizing in the form pre-
sented herein. This section will attempt to clarify this and, hopefully, shed light on
some of the research that has been done as well as suggest future work and directions
that would have been interesting to pursue if time permitted.
6.1 Context and Motivation
The project was motivated by the field of conceptual blending and the recent research in
computationally modeling the cognitive process of blending two concepts to achieve a
novel output. More specifically, the aim was to build on Project Coinvent (2017) which
aimed to “develop a computationally feasible, cognitively-inspired formal model of
concept creation, drawing on Fauconnier and Turner’s theory of conceptual blending”
and enclosed some research on the conceptual blending of musical models, cited earlier
in this report (see chapter 2).
While the main aim was to expand the body of research on computational concep-
tual blending of music models, a fundamental part of that was the original constituent
models to be blended. The initial plan was to acquire two models presented by other
researchers from the vast body of works on probabilistic music style modeling. How-
ever, due to reasons including outdated code repositories, non-responsive authors and
incomplete, shallow access to models (eg. only access to final output) that does not
allow for custom manipulation of the predictions, this plan did not go through and no
suitable algorithmic composition models were acquired.
38
Chapter 6. Conclusion 39
Stemming from this, it was decided that the project be augmented by the first part,
in which two probabilistic music style models were designed and implemented to be
used in the second stage, in which the computational model for conceptual blending
was presented and tested. This sudden change of the scope of the project and the
consequent increase of time restriction was a significant challenge and stood in the
way of many possibilities worth pursuing, which will be discussed later.
6.2 Summary and Future Work
A deep learning model architecture was presented and implemented that was success-
ful in capturing the two styles chosen for the task. Participant surveys showed a general
satisfaction with the extent to which the styles were modeled and a tendency towards
a slightly above average rating of the models. There were a few shortcomings of the
models that were discussed and speculated upon and could be focused on further in the
future.
Having successfully acquired the two models, two conceptual blending models
were proposed, one of which did not achieve acceptable results to proceed to the par-
ticipant evaluation section and was therefore discarded. The other proposed model
achieved very good results and participant valuation showed that, on average, the
model successfully blends the two constituent musical styles into a new output that
captures characteristics from both of them. It is presumed that this proposed computa-
tional model for conceptual blending would prove equally successful in blending other
sequence prediction models, as it is agnostic to the application domain and does not
use specialized musical knowledge or rules in any way.
From this concluding note, below are some possible directions for future research
that could have been pursued if it weren’t for the time constraint.
• An intuitive extrapolation of this research would be to test the proposed concep-
tual blending approach on other models and application domains and examine
the output to see if it achieves equally good results. This could be in either sim-
ilar models that extrapolate a long history sequence by predicting the next step
or even in simpler Markov models that only take a very short history in con-
sideration and applying the concept of alternating different predictive models to
it.
• The music style models can be augmented in many different ways to achieve
Chapter 6. Conclusion 40
better results. The literature on this is extensive and covers many aspects of mu-
sic, but specifically improving the model proposed here would involve a better
music representation that is more suitable to capture the durations and allows the
rhythm to be learned better, as opposed to the one used here, which is problem-
atic for the reasons explained in the chapter 5.
• As previously stated, the literature on evaluating both algorithmic composition
and conceptual blending models is fraught with obstacles and issues that make it
very difficult to reach a method of objective evaluation through which different
models can be compared. More research certainly needs to be done to fill this
gap and, hopefully, get a clearer, more quantitative metric to compare against.
• More obviously, more research on computational conceptual blending needs
to be undertaken, whether that is research in different domains where concep-
tual blends could be of use, or on different computational models of conceptual
blending that can successfully generate a novel, creative output given two differ-
ent conceptual spaces.
Appendix A
Participant survey
1. Do you have any musical background? Please rank your knowledge of music
on a scale of 1 to 5, with 1 indicating no musical background and 5 indicating
full professional proficiency of a musical instrument(or vocal performance) and
music theory.
O 1 O 2 O 3 O 4 O 5
2. You will now listen to two pieces of music, one of which is a traditional reel(dance)
and another by an artificial intelligence agent trained on traditional British and
American folk reels? After listening to both pieces, please indicate which one
you believe is a real traditional tune and which is composed by the AI (please use
“1” to indicate the first of the two pieces you just listened to, and “2” to indicate
the second).
traditional .........
AI .........
3. To what extent would you say the AI succeeded in modeling the given musical
style (you may think of this rephrased as ‘how similar is the style of the two
pieces’)? Please give a ranking on a scale of 1 to 5, with 1 indicating poor per-
formance or “no similarity in style” and 5 indicating excellent performance or
“identical style” between the two compositions.
O 1 O 2 O 3 O 4 O 5
41
Appendix A. Participant survey 42
4. You will now listen to two pieces of music, one of which is composed by Bach
and another by an artificial intelligence agent trained on pieces written by Bach
and posing as him. After listening to both pieces, please indicate which one you
believe is composed by Bach and which one you believe is by the AI (please use
“1” to indicate the first of the two pieces you just listened to, and “2” to indicate
the second).
Bach .........
AI .........
5. To what extent would you say the AI succeeded in modeling the given musical
style (you may think of this rephrased as ’how similar is the style of the two
pieces’)? Please give a ranking on a scale of 1 to 5, with 1 indicating poor per-
formance or “no similarity in style” and 5 indicating excellent performance or
“identical style” between the two compositions.
O 1 O 2 O 3 O 4 O 5
6. You will now listen to a piece that attempts to combine aspects of the two pre-
vious music styles? Please characterise the style of the final piece using the
following scale:
O Exclusively folk
O Predominantly folk, only incidentally similar to Bach
O Equally folk and Bach style
O Predominantly Bach, only incidentally similar to folk reels
O Exclusively Bach
Appendix A. Participant survey 43
If you have any further comments on the experiment, algorithmic composition,
computer creativity, the output of the AI composers in relation to the original
pieces they learned from or insights on how it decided to blend the two styles
into one piece, please elaborate below.
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
Bibliography
Adiloglu, K. and Alpaslan, F. N. (2007). A machine learning approach to two-voice
counterpoint composition. Knowledge-Based Systems, 20(3):300–309.
Ames, C. (1987). Automated composition in retrospect: 1956-1986. Leonardo, pages
169–185.
Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with
gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166.
Bishop, C., Bishop, C. M., et al. (1995). Neural networks for pattern recognition.
Oxford university press.
Boden, M. A. (2009). Computer models of creativity. AI Magazine, 30(3):23.
Boenn, G., Brain, M., De Vos, M., et al. (2008). Automatic composition of melodic
and harmonic music by answer set programming. In International Conference on
Logic Programming, pages 160–174. Springer.
Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2012). Modeling tempo-
ral dependencies in high-dimensional sequences: Application to polyphonic music
generation and transcription. arXiv preprint arXiv:1206.6392.
Bowles, E. (1970). Musicke’s handmaiden: Or technology in the service of the arts.
The computer and music, pages 3–20.
Cambouropoulos, E., Kaliakatsos-Papakostas, M., and Tsougras, C. (2015). Structural
blending of harmonic spaces: a computational approach. In Proceedings of the 9th
Triennial Conference of the European Society for the Cognitive Science of Music
(ESCOM).
Cherla, S., Tran, S. N., Weyde, T., and Garcez, A. S. d. (2015). Hybrid long-and
short-term models of folk melodies. In ISMIR, pages 584–590.
44
Bibliography 45
Chollet, F. et al. (2015). Keras. https://keras.io.
CoInvent, P. (2017). Concept invention theory project. In Concept Invention Theory
Project Abstract. European Commission.
Conklin, D. (2003). Music generation from statistical models. In Proceedings of
the AISB 2003 Symposium on Artificial Intelligence and Creativity in the Arts and
Sciences, pages 30–35. Citeseer.
Cope, D. and Mayer, M. J. (1996). Experiments in musical intelligence, volume 12.
AR editions Madison, WI.
Cuthbert, M. S. and Ariza, C. (2010). music21: A toolkit for computer-aided musicol-
ogy and symbolic music data.
Eck, D. and Schmidhuber, J. (2002). A first look at music composition using lstm
recurrent neural networks. Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale,
103.
Eppe, M., Confalonieri, R., Maclean, E., Kaliakatsos, M., Cambouropoulos, E., Schor-
lemmer, M., Codescu, M., and Kuhnberger, K. (2015). Computational invention of
cadences and chord progressions by conceptual chord-blending. AAAI Press; Inter-
national Joint Conferences on Artificial Intelligence.
Fauconnier, G. and Turner, M. (2008). The way we think: Conceptual blending and
the mind’s hidden complexities. Basic Books.
Goguen, J. A. (2006). Mathematical models of cognitive space and time. In Andler,
D. et al., editors, Reasoning and Cognition: Proceedings of the Interdisciplinary
Conference on Reasoning and Cognition, pages 125–148, Tokyo. Keio University
Press.
Hadjileontiadis, L. J. (2014). Conceptual blending in biomusic composition space:
The “brainswarm” paradigm. In ICMC.
Haigh, C. (2014). Exploring folk fiddle - an introduction to folk styles, technique and
impro. Schott & Co.
Herremans, D., Sorensen, K., and Conklin, D. (2014). Sampling the extrema from
statistical models of music with variable neighbourhood search.
Bibliography 46
Hild, H., Feulner, J., and Menzel, W. (1992). Harmonet: A neural net for harmoniz-
ing chorales in the style of js bach. In Advances in neural information processing
systems, pages 267–274.
Hiller Jr, L. A. and Isaacson, L. M. (1957). Musical composition with a high speed
digital computer. In Audio Engineering Society Convention 9. Audio Engineering
Society.
Johnson, D. D. (2017a). Generating polyphonic music using tied parallel networks. In
International Conference on Evolutionary and Biologically Inspired Music and Art,
pages 128–143. Springer.
Johnson, D. D. (2017b). Generating polyphonic music using tied parallel networks.
In Correia, J., Ciesielski, V., and Liapis, A., editors, Computational Intelligence
in Music, Sound, Art and Design, pages 128–143, Cham. Springer International
Publishing.
Kaliakatsos-Papakostas, M., Cambouropoulos, E., Kuhnberger, K.-U., Kutz, O., and
Smaill, A. (2014). Concept invention and music: creating novel harmonies via
conceptual blending. In In Proceedings of the 9th Conference on Interdisciplinary
Musicology (CIM2014), CIM2014. Citeseer.
Mao, H. H., Shin, T., and Cottrell, G. (2018). Deepj: Style-specific music generation.
In Semantic Computing (ICSC), 2018 IEEE 12th International Conference on, pages
377–382. IEEE.
Martin, A., Jin, C., van Schaik, A., and Martens, W. L. (2010). Partially observable
markov decision processes for interactive music systems. In Proceedings of the
International Computer Music Conference.
Miles Huber, D. (1991). The midi manual. USA. Howard W. Sams.
Moon, T., Choi, H., Lee, H., and Song, I. (2015). Rnndrop: A novel dropout for rnns
in asr. In Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE
Workshop on, pages 65–70. IEEE.
Mozer, M. C. (1994). Neural network music composition by prediction: Exploring
the benefits of psychoacoustic constraints and multi-scale processing. Connection
Science, 6(2-3):247–280.
Bibliography 47
Mumford, M. D. (2003). Where have we been, where are we going? taking stock in
creativity research. Creativity research journal, 15(2-3):107–120.
Nierhaus, G. (2009). Algorithmic composition: paradigms of automated music gener-
ation. Springer Science & Business Media.
Olson, H. F. (1967). Music, physics and engineering, volume 1769. Courier Corpora-
tion.
Pachet, F. and Roy, P. (2011). Markov constraints: steerable generation of markov
sequences. Constraints, 16(2):148–172.
Ponsford, D., Wiggins, G., and Mellish, C. (1999). Statistical learning of harmonic
movement. Journal of New Music Research, 28(2):150–177.
Quick, D. (2016). Learning production probabilities for musical grammars. Journal of
New Music Research, 45(4):295–313.
Raczynski, S. A. and Vincent, E. (2014). Genre-based music language modeling with
latent hierarchical pitman-yor process allocation. IEEE/ACM Transactions on Au-
dio, Speech, and Language Processing, 22(3):672–681.
Schmidhuber, J. and Hochreiter, S. (1997). Long short-term memory. Neural Comput,
9(8):1735–1780.
Swift, A. (1997). A brief introduction to midi. URL http://www. doc. ic. ac. uk/˜
nd/surprise 97/journal/vol1/aps2, 6.
Todd, P. M. and Werner, G. M. (1999). Frankensteinian methods for evolutionary
music. Musical networks: parallel distributed perception and performace, pages
313–340.
Turner, M. (2014). The origin of ideas: Blending, creativity, and the human spark.
Oxford University Press.
Zacharakis, A., Kaliakatsos-Papakostas, M., Tsougras, C., and Cambouropoulos, E.
(2017). Creating musical cadences via conceptual blending: Empirical evaluation
and enhancement of a formal model. Music Perception: An Interdisciplinary Jour-
nal, 35(2):211–234.
Bibliography 48
Zacharakis, A. I., Kaliakatsos-Papakostas, M. A., and Cambouropoulos, E. (2015).
Conceptual blending in music cadences: A formal model and subjective evaluation.
In ISMIR, pages 141–147.