Probabilistic Music Style Modeling and a Computational ... · chitecture for music style modeling....

Probabilistic Music Style

Modeling and a Computational

Approach for Conceptual

Blending in Music

Mohamed Nour Nader Abouelseoud -

s1716655

Master of Science

Artificial Intelligence

School of Informatics

University of Edinburgh

2018

Abstract

This dissertation will describe the design and implementation of a deep learning ar-

chitecture for music style modeling. Two LSTM based machine learning models will

be trained to learn different styles, Bach and traditional folk music, and generate new

pieces in their respective styles. We then propose a computational model for the blend-

ing of these two models to generate a new conceptual blending model that generated

music characteristic of both styles. I then conduct some quantitative analysis of the

outputs of the models and show that they succeed in their respective tasks to varying

extents. I also present participant survey results showing a general audience agree-

ments that the generative music models succeed in capturing the specified genres and

that the blended model almost perfectly blends the models and generates musical out-

puts that equally contains characteristics of both constituent genres.

i

Acknowledgements

I would like to thank my supervisor Alan Smaill for his advice and patience, Dr. Joe

Corneli for volunteering his time to give me constant feedback, and my friends and

colleagues for agreeing to participate in my study and intently listening to my musical

output. Credit also goes to MIT’s Michael Cuthbert and his Music21 library (2010),

with which all handling, pre-processing, post-processing and graphing of the midi and

music files was implemented.

ii

Declaration

I declare that this thesis was composed by myself, that the work contained herein is

my own except where explicitly stated otherwise in the text, and that this work has not

been submitted for any other degree or professional qualification except as specified.

(Mohamed Nour Nader Abouelseoud - s1716655)

iii

To my parents

iv

Table of Contents

1 Introduction 11.1 Project Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Background and Literature Review 42.1 What is Meant by Algorithmic Composition . . . . . . . . . . . . . . 4

2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 A Brief History of Algorithmic Composition . . . . . . . . . 5

2.2.2 Narrowing Down the Scope . . . . . . . . . . . . . . . . . . 7

2.3 Markovian and Other Statistical Models . . . . . . . . . . . . . . . . 7

2.3.1 Motivation and Initial Attempts at Using Markov Models . . . 7

2.3.2 Recent Success of Markov Models . . . . . . . . . . . . . . . 8

2.3.3 Shortcomings of Markov Models . . . . . . . . . . . . . . . 9

2.4 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.1 Motivation and Definition of ANNs . . . . . . . . . . . . . . 9

2.4.2 Success of ANNs in Modeling Musical Structure . . . . . . . 10

2.5 Conceptual Blending . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5.1 On Conceptual Blending . . . . . . . . . . . . . . . . . . . . 11

2.5.2 Conceptual Blending in Music . . . . . . . . . . . . . . . . . 12

3 Methods 133.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 The Bach Model . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.2 The Traditional Folk Music Model . . . . . . . . . . . . . . . 14

3.2 Pre-Processing and Data Cleaning . . . . . . . . . . . . . . . . . . . 15

3.3 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

v

3.4.1 Recurrent Neural Networks and LSTMs . . . . . . . . . . . . 18

3.4.2 Hyper-parameters . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4.3 Sampling the Next Time-step . . . . . . . . . . . . . . . . . 20

3.4.4 Conceptual Blending . . . . . . . . . . . . . . . . . . . . . . 21

4 Experiments 244.1 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 Training the Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Intermediate Stages and Experimentation . . . . . . . . . . . . . . . 27

5 Evaluation 295.1 On Evaluating Music Composition Models . . . . . . . . . . . . . . . 29

5.2 On Evaluating Conceptual Blending Models . . . . . . . . . . . . . . 30

5.3 Analyzing Musical Outputs Against Training Data . . . . . . . . . . 31

5.4 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6 Conclusion 386.1 Context and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.2 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . 39

A Participant survey 41

Bibliography 44

vi

List of Figures

3.1 Triplet preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Data irregularities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Different note representations . . . . . . . . . . . . . . . . . . . . . . 18

3.4 Bach model architecture . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 Model training losses against epochs . . . . . . . . . . . . . . . . . . 26

5.1 Pitch class frequency histograms . . . . . . . . . . . . . . . . . . . . 34

5.2 Pitches and note lengths weighted scatter diagrams . . . . . . . . . . 35

5.3 Participant rating of the style models’ accuracy showing notable spread

of opinions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.4 Participant rating of the blending model showing polarization of opinion 37

vii

Chapter 1

Introduction

1.1 Project Overview

This report will discuss the design, development and implementation of an artificial

intelligence music composition system. This project, however, is split into more than

one part, as it is multifaceted and will be tackling many different problems within the

realm of algorithmic composition, music style modeling, the relatively recent cognitive

theory of conceptual blending, how to computationally apply it and how it pertains to

computational creativity in general.

The first part of the project is to construct two different generative machine learning

models using artificial neural networks. Each of the two models will learn from a

training set of a different music style than the other and their output will be evaluated

both objectively and subjectively through user surveys. The two models will be used

in the second part of the project.

The second part examines the cognitive theory of conceptual blending and its rela-

tion to computational creativity and generative Artificial Intelligence. The two genera-

tive models implemented in the first part will be used and different hypotheses for com-

putationally implementing conceptual blending using the two models and their outputs

will be demonstrated and examined and their individual effectiveness as it compares to

the others will be commented on.

1.2 Organization

The report will start with a preamble containing some commentary on the use of the

term “algorithmic composition” in the literature and how it is used in the report and

1

Chapter 1. Introduction 2

how it relates to this project, followed by a background introducing the different as-

pects of the project and a literature review describing the previous work done in the

field of algorithmic composition and conceptual blending.

This will be followed by a chapter describing the methodology and different ap-

proaches by which different model were tackled. The section will also describe the

different key aspects of the models, including the training data sets, the digital repre-

sentation of the musical data, the artificial neural network architecture used as well as

the hyper-parameters used in the models. The section on the data used will also discuss

some anomalies present in the music pieces comprising the datasets and the cleaning

process of the data.

Building on the methodology, the next section will describe the experiments and

training of the models specified earlier and comment briefly on the training process.

Also in this section, different iterations or design decisions in the evolution of the final

models will be discussed and will be commented on to shed light on why they were

discarded during the design and implementation of the project. These iterations include

different rules by which the music is generated, different representations of the music

as well as other aspects that changed, sometimes drastically, along the course of the

project.

The decision to include these passing models and commenting on them and why

they were seen to not be satisfactory in the experiments chapter, as opposed to method-

ology, was not arbitrary. It did not seem fitting to include them in the methodology

chapter as they do not directly relate to the final implementation and methods used

in the final outcome that was actually scrutinized and evaluated. The importance to

mention them and why they were not satisfactory, however, remained and dedicating a

section in the experiments chapter seemed most suitable, as trying out these different

models and design decisions falls under the category of ‘experimenting’ for the best

output of the model and contribute greatly, through trial and error and the consequent

evolution of the model, to the final outcome.

The methodology and experiment details will be followed by an evaluation chapter.

The evaluation of the different models will be done by analyzing the pieces musically

and examining the properties of the different styles as well as using a subjective eval-

uation. The subjective evaluation will consist of user surveys in which they listen to

the different pieces and rate how successfully they model the style given or, in the case

of the conceptual blending model, how well it blends the two styles into one novel

musical product.

Chapter 1. Introduction 3

The report will be concluded by a final word giving a better idea of the context

in which this research falls and where the idea originated, then a brief reiteration of

the results to sum up the paper, followed by a final future works section suggesting

possible future directions for research.

Chapter 2

Background and Literature Review

2.1 What is Meant by Algorithmic Composition

By way of a much needed preamble and before introducing some of the history and

previous work on algorithmic composition, a word needs to be said on the different

uses of the term ‘algorithmic composition’ in the literature and how it will be used in

this paper.

Nierhaus (2009), in his very important, comprehensive book on algorithmic com-

position, makes the distinction between “algorithmic composition as a genuine method

of composition,” in which the complete overall structure and the music is determined

exclusively by the algorithms or system in place, and “style imitation,” in which “mu-

sical material is generated according to a given style,” which is mostly used as an

attempt for music analysis and verification by resynthesis. The latter, the way Nier-

haus describes it, is similar to the way grammars are used for parsing sentences and

verifying their linguistic soundness by resynthesis using rules and parse trees in natural

language processing.

Nierhaus also commented on this vagueness and the gray areas of the definition

between genuine composition and style imitation (as he defined it). Due to this unclear

line that separates them, it is often hard to categorize a certain algorithmic composition

system into one of these two categories, and the system proposed by this paper is no ex-

ception. However, I would argue that, since the machine learning system implemented

and described does in fact generate novel compositions completely autonomously and

all aspects of the compositions are determined exclusively by the system, and despite

the fact that it does ‘imitate’ a certain style from which it was trained, it falls in the

first definition of the word.

4

Chapter 2. Background and Literature Review 5

The fact that this system, and almost every algorithmic composition system in exis-

tence, even completely autonomous ones, have to gain their rules from a body of works

(which would, inevitably, have a certain style) also sheds light on the problematic

choice of words used in the definitions. A music composition system, unless explicitly

designed to generate completely novel pieces of music that fall under a completely

new never before seen ‘style’ or genre (a very strange, difficult goal), will necessarily

imitate a certain style of music, causing confusion when attempting to categorize the

system into one of these two definitions mentioned here.

It should also be noted that a third, arguably erroneous, definition of the word can

be gleaned when surveying some of the literature by music and arts researchers. In the

field of music and arts in general, the word “algorithmic composition” has been used

to refer to programs designed to aid composers with some aspects of the composition

process, such as visualizing music or assisting with score writing. The more suitable

term for this is “computer aided composition” and is unrelated to the topic at hand.

The following sections will provide a historical background and literature review

of the field and previous work undertaken in algorithmic composition, some of which

has been presented previously in the project proposal (2018) but needs to be reiterate

for emphasis and clarity and to put the project in perspective relative to the field.

2.2 Background

2.2.1 A Brief History of Algorithmic Composition

Since the first, most primitive occurrences of automation, the problem of algorithmic

composition, or using computers to “compose elaborate and scientific pieces of mu-

sic of any degree of complexity or extent (Bowles, 1970),” has been brought up. Ada

Lovelace, while working with Charles Babbage on the difference and analytical en-

gines around 1840, hinted at the problem (Herremans et al., 2014). However, despite

initial optimism and excitement, expectations were reduced and the problem proved

far more challenging than initially assumed, with computers not able to generate even

simple melodies anywhere near acceptable to humans (Conklin, 2003).

Music composition, and other creative tasks in general, is a complex and very elu-

sive task that is still to this day not entirely understood. It consists of “accumulated

individual experiences, cultural contexts, and inherited predilections swirling about

within the composer” to create a work of music (Todd and Werner, 1999), which is


not an easy challenge to approach and solve even with the most advanced artificial

intelligence. Over the years, many machine learning and probabilistic models have

been devised to try to model certain music styles statistically and achieve as realistic

results as possible. Solving this task not only solves the titular problem of algorithmic

composition in the sense of generating an aesthetically pleasing output that success-

fully captures a certain musical style as well as satisfying music theory constraints,

but achieving a satisfactory solution would have many consequent applications useful

for music information retrieval systems, like identifying composers, automatic music

transcription, finding a suitable harmony to a given melody, music segmentation and

pitch estimation (Raczynski and Vincent, 2014).

The “musical dice game” attributed to Mozart, but with no proper authentication

or justification to this attribution (Cope and Mayer, 1996), seems to be the earliest de-

scribed system of algorithmic composition using a probabilistic model. In the game,

the player rolls two dice and, based on the outcome, chooses 16 measures from pro-

vided measures precomposed by the composer, then, based on the roll of a third dice,

picks another 16 measures from a different set of bars to form the trio to the minuet

(Boenn et al., 2008). Despite the fact that it does use a statistical model and the number

from the dice roll to randomly ‘generate’ music, the creative task is still the work of

the composer who provided the measures, and not of the statistical model.

Following that, there was a significant hiatus period for the field of algorithmic or

automated composition for a long while, despite the initial interest evidenced from the

previous example and Lovelace’s mid-nineteenth century quote. This was a natural

consequence to the fact that any technology needed to even come close to achieving

this ambitious vision of music generation and analysis by an autonomous machine was

not to be seen for decades to come. Decades later, however, with the introduction

of digital computers in the mid-twentieth century and the sufficient advancement of

technology, this area of interest was immediately revived and brought back to the sci-

entific community’s eye, most notably by Hiller Jr and Isaacson (1957) in their 1957

composition titled The Iliac Suites, in which a Markov chain model paired with a mu-

sic composition rule system was employed. Iannis Xenakis is another notable early

contributor who famously used computers and stochastic methods to generate music,

some of which also used Markov model (Ames, 1987). Some of these methods will be

elaborated on later in this section.


2.2.2 Narrowing Down the Scope

While the idea of generating new music using a set of rules and generation probabilities

is not a new development nor is it a whim of the artificial intelligence research commu-

nity, it has undeniably flourished and resurfaced to prominence in the past few decades

due to lower computing costs and the consequent advances in AI research. Many AI

systems, with varying levels of complexity and success were devised to tackle the

problem of algorithmic composition as well as other subproblems of it.

These methods include grammars, rule-based systems, constraint satisfaction mod-

els, using Markov chains and statistical models, evolutionary algorithms, cellular au-

tomata, artificial neural networks and other methods from the field. This survey, how-

ever, only covers machine learning methods used for algorithmic composition, includ-

ing Markov models and artificial neural networks. More exhaustive surveys that cover

these topics, and more, are available, such as Fernandez & Vico’s (2013) Survey of the

field.

2.3 Markovian and Other Statistical Models

2.3.1 Motivation and Initial Attempts at Using Markov Models

Boenn et al. (2008) summarize the problem of creativity as it pertains to musical com-

position briefly as “where is the next note coming from?” To answer this question,

among the first intuitions was to use techniques developed applied for natural lan-

guage processing (Nierhaus, 2009) to create generative music models, as they are both

essentially a sequence generation problem. From here, experimenting with Markov

chains and other Markovian model variations to generate music started.

Harry Olson was among the first to experiment with Markov models in musical

applications (Nierhaus, 2009). Olson Analyzed 11 pieces, standardized to the key of

D, and produced first and second order Markov models for pitch and rhythm (Olson,

1967). Upon examining the results, Olson asserted that the second order model pro-

duced better melodies and better resolutions for the generated music.

Hiller Jr and Isaacson (1957) experimented with algorithmic composition to gen-

erate the Illiac Suite, in which the forth movement, or “experiment,” as they referred to

it, used variable order Markov chains to model different aspects of the music like skips

and sound textures to generate the notes.

Along with the aforementioned efforts, another pioneer in computer aided com-


position, Iannis Xenakis, was concurrently working with Markov models to generate

music (Nierhaus, 2009). He used Markov chains to generate a transition probability

table of different “screens.” The screens encoded different dynamics and instrumenta-

tions that were to be used at different points in the final music composition.

2.3.2 Recent Success of Markov Models

Among the more recent experiments in music generation using Markov models is the

experiment done by Ponsford et al. (1999) in which he used a corpus of 84 seventeenth

century sarabande (a simple dance form in triple time) pieces to generate acceptable

pieces. The experiment included a preprocessing step to annotate the music with the

explicit harmonic structure of the pieces and a post-processing step paired with a set

of grammar rules to ensure acceptable results. It is worth mentioning that the prepro-

cessing step in which the pieces are annotated with the harmonies is almost perfectly

analogous to part of speech tagging in natural language processing.

Ponsford et al.’s system achieved relatively successful results, with most pieces

starting and ending in the same key, as well as resolving with a perfect or imperfect

cadences, a characteristics that all of the corpus pieces used for training have. It should

be noted that the evaluation methods used by Ponsford et al. are completely subjective

and depend on human perception and the evaluators’ judgment as to whether the pieces

resemble the training corpus or not. This is common in generative models of music or

creative AI systems dealing with similarly subjective tasks. This evaluation problem

will be elaborated on later and used in our proposed system.

Another successful implementation was by Pachet and Roy (2011), where they im-

plemented a mixed model making use of a combination of Markov processes as well as

a constraint system. The set of constraints addressed an interesting problem, since they

ensured that even when applied the underlying probabilities of the Markov model are

satisfied and reflected in the results, a quality that other models up to that points could

not uphold and would usually violate the Markov property upon applying constraints.

While the model does generate subjectively agreeable variations, continuations and

answers to given melodies, the paper contains little perusal or examination notes re-

garding the evaluation, perhaps due to the assumption that, since it is satisfying the

constraints, the system at least generates minimally acceptable results. However, this

lack of evaluation remains a significant shortcoming especially in a machine learning

paper dealing with computational creativity, where the evaluation, examination and


fitness of the output is of utmost importance.

2.3.3 Shortcomings of Markov Models

While Markov models can achieve favorable results, as shown, and have proven useful

as a tool of probabilistic music style modeling, they still have an inherent inability

to capture important temporal features and long term dependencies present in music,

such as motif repetition and stylistic variations on a theme (Quick, 2016). To tackle

this problem, there has been experimentation with variations on Markov models as

well as combinations of it with other non-markovian models.

One notable attempt is the system devised by Martin et al. (2010) that employs

Partially Observable Markov Decision Processes(POMDP). POMDP is a generaliza-

tion of hidden Markov models that takes actions and interactions between the agent

and the environment into consideration when choosing the next state, to create a sys-

tem that improvises alongside a musicianship. Martin et al. used a simple evaluation

function based on the key in which the generated improvisation is in and whether it

is the correct key expected or not and achieves good results. This was an interesting

effort since POMDPs were never used in this context of algorithmic composition or in-

teractive music generation before. The experiment also marked a certain move outside

the conventional Markovian language modeling of music and into other realms.One

such real of machine learning that has proven very successful utilize Artificial Neural

Networks (ANNs). These will be discussed in further detail in the next section and

will be the main focus of the project.

2.4 Artificial Neural Networks

2.4.1 Motivation and Definition of ANNs

ANNs are a biologically inspired machine learning model that allows for problem solv-

ing by manipulating and changing the structure of internally connected components

(Nierhaus, 2009) and the different weighed sums and connections between them. As

Nierhaus points out, employing neural networks for sequence generation to compose

music allows for surprising models that follow the underlying qualities of the corpus

and, unlike generative grammars or Markov models, do not blindly generate transitions

observed in the training corpus. This quality of detecting subtleties resulted in many


attempts, some to notable success, to tackle the problem of algorithmic composition

using ANNs.

2.4.2 Success of ANNs in Modeling Musical Structure

One of the first systems to seriously employ artificial neural networks for the purpose

of music composition is HARMONET, developed by Hild et al. (1992), which aimed

to generate four part harmonies for J.S.Bach’s choral lines in his style, a popular com-

position exercise for music students. HARMONET uses recurrent neural networks

(RNN), in which the output of one time-step is used again as an input to the next one,

to achieve the task. The neural network has 106 input units representing the current

harmonic and melodic contexts, a boolean representing whether the current note is a

stressed beat or not, and the relative position of the current harmony relative to the

beginning or end of a musical phrase and was trained on two sets of twenty Bach

chorales in both major and minor keys. The output of the recurrent neural network,

denoted in the paper as Ht , represented the generated harmony at the current time-step,

or beat, and would be fed into the network as the current harmonic context and used

along with the other previously mentioned inputs to generate Ht+1. This is paired with

another, simpler neural network not discussed elaborately in the original paper which

is responsible for adding ornamentation to the output of the RNN.

Harmonies generated by HARMONET were judged by ”an audience of musical

professionals” to be on the level of a professional musician and suitable for musical

practice (Hild et al., 1992), which is why it bears significant importance for the field

as one of the first experiments using ANNs, and more specifically RNNs, to yield such

successful results.

Recurrent Neural Networks will be introduced in further detail later, as our music

composition model will use Long Short-term Memory (LSTM) network, a type of

recurrent neural network with certain desirable features for the music composition task.

Adiloglu and Alpaslan (2007) also achieved impressive results with a relatively

simple neural network model. Their ”Neurocomposer” system used a simple feed for-

ward neural network to generate a first species counterpoint to a given melody. What

is interesting in the model is the input representation of music that they used, since

representation in these models is very important and instrumental to the performance

of the network. NeuroComposer representation denoted the pitch height, location in

the circle of fifth and location on the chromatic circle. The model was also subjec-


tively evaluated and it was shown that the model’s output generally obeys the rules of

counterpoint, while another evaluation by three professional musicians asserted that

the pieces do not consistently obey the rules, but that the output is generally correct.

2.5 Conceptual Blending

2.5.1 On Conceptual Blending

Creativity, among many competing definitions, is defined as the process or phenomenon

by which we produce a useful product that is previously unseen or unobserved (Mum-

ford, 2003). Boden (2009) divides creativity into three models by which we can gen-

erate this novel outcome: exploratory creativity, in which new regions of a concep-

tual space are examined; transformational creativity, in which established concepts are

manipulated in original, previously unseen ways; combinational creativity, in which

conceptual spaces that are seemingly not directly related are somehow linked and as-

sociations made between their elements. Boden also asserts that the most difficult to

formally describe is combinational creativity.

Directly related to this is Turner and Fauconnier’s (2008; 2014) cognitive theory of

conceptual blending, in which they describe the merging, or ”blending” of elements or

aspects of two conceptual spaces to create a novel concept which combines aspects of

both of the spaces from which it is blended. This new blended space, besides retaining

some of the qualities of the original spaces, possesses novel properties that are inter-

pretable in a way that allows for better understanding of the constituent spaces or the

emergence of a completely new idea or concept (Kaliakatsos-Papakostas et al., 2014).

Hadjileontiadis (2014) asserts that, despite the relative simplicity of generating

novel products by blending old ones, the difficulty lies in applying this combinational

creativity or blending of concepts in a “computationally tractable way, and in being

able to recognize the value of newly invented concepts for better understanding a cer-

tain domain; even without it being specifically sought, i.e., by ‘serendipity.’” This dif-

ficulty makes it challenging to construct a ‘universal’ or generic conceptual blending

model that can be successfully applied to any domain. Another difficulty that arises as

a consequence of this and as a result of the sometimes vague nature of what a suitable

blend between two spaces might be is evaluation of the outcome of the blending model.

This will be elaborated on further in the evaluation section.

Further on conceptual blending, Goguen (2006) , unlike preceding proposals, pro-


vides an explicit computational account of blending as well as an interesting mathe-

matical framework for representing space and time that supports conceptual blending

in several “metaphorical” ways and explores four examples showing different kinds

of ambiguity in this process of conceptual blending. The examples include geometric

conceptual spaces, a puzzle that incorporates conceptual blending as a solution, the fa-

mous example of the different possible blends of “boat” and “house” and their implied

conceptual spaces as well as temporal reasoning using spatial analogy. The framework

provides important insight on conceptual blending and the related cognitive process.

2.5.2 Conceptual Blending in Music

Despite the fact that conceptual blending is a relatively new theory, there has been a lot

of interest in it as it relates to computational creativity, especially in music, since its be-

ginning. Kaliakatsos-Papakostas et al. (2014) have tackled the problem of conceptual

blending of harmonic spaces and achieved very satisfactory results. They attempted

conceptual blending on the chord level, the harmonic progression level as well as on

the harmonic structure level (in which they created a ”blendoid” of two harmonic sys-

tems, namely the symmetric harmonic space and the diatonic space). They also briefly

comment on the ”meta-level” harmonic blending, in which more metaphorical connec-

tions between different domains are examined. They discuss this meta-level blending

as it pertains to emotions and moods conveyed by the chords.

Similarly, Cambouropoulos et al. (2015) further tackles the problem of harmonic

space blending and also tackles chord level, sequence or progression level and har-

monic structure level blending. In addition, the paper tackles scale level blending,

in which intervals from two different scales are blended to give rise to a new scale

or mode, which happens frequently in jazz. They also give examples of melody-

harmony blending and provide very interesting demonstrations showing The Beatles’

song Michelle and a Bach chorale melody harmonized in different styles.

Moreover, Eppe et al. (2015) and Zacharakis et al. (2015) use a similar approach of

chord blending to invent novel, blended cadences, often giving rise to very interesting

substitutions or transition chords. They also describe their evaluation method which

will be commented on later in the evaluation section.

Chapter 3

Methods

In this section, the datasets used for training the models, their features, the preprocess-

ing and cleaning undertaken before using them as well as the digital representation of

the data and how it fits into a machine learning framework. The model architectures

used will also be discussed and, in the case of the conceptual blending part of the

project, the blending hypothesis of the models will be described.

3.1 Data

3.1.1 The Bach Model

For the first generative model, the style, or genre, of choice was decided to be baroque

music, and, more specifically, in the style of Bach, Baroque period’s most prominent

composer. For this purpose, it was decided that the first of his famous suites for solo

cello will be used for training the AI machine learning model.

The first suite was chosen because it has the following desirable qualities that

would facilitate training and fit the model used, which will be described in a later

section of this chapter:

• It consists of six pieces, all of which are in the key of G major. This uniformity is

crucial, as the model has no way of explicitly distinguishing the key of a piece.

As a result, unless the system is augmented with elaborate rules or a different

representation to define the concept of a harmonic key, the model is likely to be

confused with the different sets of notes in different pieces steering the training

in different directions and, consequently, might end up generating pieces with

13

Chapter 3. Methods 14

random accidental notes that don’t fit a certain key and sound dissonant and

unpleasant.

• The pieces are rhythmically simple and contain no complex durations or beat

divisions, a characteristic beneficial for the digital representation of music that

will be used in the model, discussed later.

• The cello suites are all solo pieces and, therefore, monophonic. This means that

only one line is written for a single instrument at any given time. There are a few

exceptions in which the cellist is expected to play a chord or a bass note with the

solo line, but these are very few and were discarded in the data cleaning process.

It should be noted that the representation used allows for polyphonic music to be

learned and generated, but since the models discussed are concerned only with

melody, which is a solo line, harmony and polyphony will not be addressed here.

• Widely regarded as his greatest and most famous work, the cello suites are

Bach’s canonical work that best capture his style and the baroque genre in gen-

eral, while still being monophonic as previously discussed. Another conse-

quence of this, the suites are easily available in MIDI format and in the public

domain, making them feasibly usable for the purpose of this project.

3.1.2 The Traditional Folk Music Model

The style to be modeled by the second AI model was decided to be traditional folk

music reels. The reel is a folk dance originating in Scotland that is also performed in

The British Isles and North America (Haigh, 2014).

For this purpose we will be using pieces extracted from the Nottingham Database1, a database containing over 1000 British, American and Irish folk pieces, including

jigs, reels, hornpipes and waltzes. The first eighty pieces were chosen from the reels

in the database. These were chosen for reasons similar to the ones mentioned earlier

in the previous section.

The pieces, however, were not as uniform as the Bach cello suites and required

some preprocessing and cleaning and elimination of some pieces from the training set.

These will be discussed in the following section.

1https://ifdo.ca/˜seymour/nottingham/nottingham.html


3.2 Pre-Processing and Data Cleaning

The music acquired was in MIDI format, a file format specifically designed for the

storage and transfer of music which makes it easy to modify and manipulate music

(Swift, 1997). A MIDI file contains messages, which carry events including pitch,

velocity and many other aspects of musical events as well as events describing music

notation like key and tempo (Miles Huber, 1991). How this format was used and the

relevant information extracted from the events to be converted into the representation

used in the model will be discussed in a later section dedicated to the topic.

The Bach pieces contained minimal irregularities since, as previously discussed,

they were uniform for the most part, with the exception of a few cases in which there

was a bass note to be played. The MIDI files provided, however2, already included

the occasional extra bass notes in a different part (since MIDI also describes music

notation, it includes different parts to be written on separate staves), which made it

easy to only extract the part with the main solo melody from the midi file. The reels

were not as uniform and therefore required more preprocessing and cleaning.

The folk reels contained many irregularities, which would disrupt the training pro-

cess and confuse the model, as explained earlier. Some of the irregularities could easily

be fixed by preprocessing with minimal change to the true piece and some were highly

disruptional and cannot be changed without drastically changing the piece. In case of

the latter, it was decided to discard the pieces containing them, as they were very few

pieces and the undesirable qualities were not characteristic of the overall folk style we

are trying to model.

The fixable irregularities consisted of two things, listed below:

• The pieces were in different keys, which is completely normal and expected,

since, unlike the Bach pieces, they were not composed as a collection associated

with each other, but were composed over many years and many different geo-

graphic localities and merely compiled in this dataset. This was fixed by trans-

posing all the keys to the key of G major and, in the few cases were the pieces

were in a minor key, E minor, the relative minor key of G major (containing the

same set of notes).

• occasional irregularities in rhythm, such as an occasional triplet (a beat divided

into three notes of equal duration). In this case, the triplet was converted into an

2Retrieved from http://www.jsbach.net/midi/


eighth note followed by two sixteenth notes, as shown below. This is due to the

fact that, since a triplet is an irregular rhythm and cannot be represented exactly

(1/3 is an infinitely repeating decimal 0.333...), it cannot be expressed in the

chosen representation, in which equally dividing the beats is crucial.

(a) Before

⇒(b) After

Figure 3.1: Triplet preprocessing

The irregularities that were not fixable without crucially changing the piece oc-

curred in five pieces, which we chose to eliminate, bringing the total size of the train-

ing set from the initial eighty one pieces down to seventy six. The irregularities, along

with some examples were:

• Some of the pieces contained polyphony. Even though they are supposed to be

folk melodies, few pieces contained parallel lines at points, which were unde-

sirable for the reasons discussed in the previous section pertaining to the Bach

suites. In this case, it was decided to discard the pieces as extracting one of the

two lines wouldn’t be as easy as in the Bach pieces and they wouldn’t have a

significant effect on the training.

• Irregular beats and strange time signature changes. Pieces containing strange

beats and note durations that cannot be easily resolved like the eighth note triplet

case were discarded.

• Changes in time signature. One of the pieces contained a sudden, very unusual

time signature change to an odd signature that lasted for one bar, which also con-

tained an uneven beat that would not be able to be captured in our representation.

• Changes in key signature. One of the pieces contained frequent changes in the

key signature back and forth throughout the piece before finally resolving back

in the original key. This would make it unfeasible to restrict it to one key even by

transposition, as the two different harmonies will persist in the intervals anyway.


As this is certainly not a quality that exists prominently in this kind of music,

this piece was removed.

(a) Polyphony (b) Odd durations and beats

(c) Key signature change midway through the

piece

(d) Time signature change

Figure 3.2: Data irregularities

3.3 Data Representation

The representation chosen for the model is a variation of piano-roll notation. Piano

roll notation is a notation that attempts to densely represent the MIDI digital score

format (Mao et al., 2018) by encoding the musical piece as a binary matrix in which

the columns represent the time step and the rows represent the note range allowed.

The resolution, or smallest possible time step, chosen, which fits both musical styles

(with the exceptions of the aforementioned anomalies) was a sixteenth note, which

represents a quarter of a beat. For example, in one standard 44 bar in a system that

allows 4 different notes, the matrix would have dimensions 4×16.

A problem that arises from this representation is that it is inherently ambiguous. A

note sustained for a certain duration is different than a note repeatedly played in each

of the time steps for that same duration, but would both be represented the same way

in this matrix. While other papers that used this representation resolved this problem

using an extra ‘replay’ matrix (Johnson, 2017a; Mao et al., 2018), due to the simpler,

monophonic nature of the melodies being learned and generated by this model, it was


decided to go with a simpler method of augmenting the matrix with a ‘sustain’ row

column, in which a bit is switched on when the last note is sustained and off otherwise

(either when another note is being played or when there is a rest, indicating a time

step of silence). The following figure shows the difference between a long note, a

repeatedly played note and a short note played at the beginning of the beat followed by

silence in a range of four allowed notes. The top most row indicates the sustain.

M =

[0 1 1 10 0 0 01 0 0 00 0 0 00 0 0 0

](a) Sustained

quarter note

M =

[0 0 0 00 0 0 01 1 1 10 0 0 00 0 0 0

](b) Four repeated

notes

M =

[0 0 0 00 0 0 01 0 0 00 0 0 00 0 0 0

](c) One sixteenth

note

Figure 3.3: Different note representations

3.4 Models

3.4.1 Recurrent Neural Networks and LSTMs

For the machine learning model used, a conventional dense layer neural network archi-

tecture architecture does not suffice, since music is a sequence of interconnected notes

and recurring themes that cannot be modeled as just an input fed into a static fully con-

nected neural network for the corresponding output to be generated. This problem of

modeling sequences has been studied in depth in the field of machine learning, since

a lot of things exists that are sequential in nature, such as speech, motion or music

(Boulanger-Lewandowski et al., 2012). Music is also particularly complex to model

due to the long term dependencies that occur in its sequences, which manifest in re-

peating harmonic progressions or a recapitulation of a melodic motif and the many

repeats typically involved in a piece of music.

Recurrent neural networks (RNNs) can theoretically solve this problem by, ideally,

keeping track of the long term history that occurred up to the most recent point in the

sequence and taking it into account during gradient back-propagation, incorporating

this data into its training process in which it learns from data. This, however, is prob-

lematic due to what is referred to as the vanishing gradient effect. Despite its ability,

in principal, to summarize an entire sequence history, Bengio et. al (1994) showed that

this is unachievable practically using gradient-descent methods, which is the standard


method for machine learning algorithms and, consequently, not all of the sequence

history is effectively incorporated into the process of tuning the model parameters to

successfully capture the given style or sequence.

To remedy this problem, many approaches have been suggested, most of which

are succinctly surveyed and summarized in Schmidhuber and Hochreiter’s 1997 pa-

per ‘Long Short-Term Memory (LSTM),’ in which they introduce the titular variation

on conventional RNN, which proved immensely successful at solving the problem.

LSTMs make use of memory cells and gated units instead of the conventional hidden

units used in the normal RNNs. These have proven to be very successful in learning

long term dependencies and repeated sequences and was demonstrated successfully

on the task musical composition and learning a blues structure successfully and abid-

ing by it (Eck and Schmidhuber, 2002). Since then, LSTMs and variations on them

have been used almost exclusively in the field of algorithmic music composition using

neural networks and to great success and interesting results. For these reasons, this

project makes use of LSTM as the main functioning layer in the deep learning model

proposed.

3.4.2 Hyper-parameters

The model used for this project for modeling both Bach and Folk music is identical

except for the input dimension which is slightly different since the range of notes

present in the Bach pieces is slightly smaller than the that of the folk reels. The hyper-

parameters of the models, listed below, were chosen using conventional manual search

and fine tuning based on the observed musical output, due to the nature of the task and

the training set, which does not allow for a validation set to be used for determining

hyper-parameters. The hyper-parameters of the model, or any deep learning model,

are listed below, along with the values chosen.

• Number of layers. In this model, there are six layers stacked in the following

order: input LSTM layer → dropout → LSTM → dropout → dense, or fully

connected, layer→ softmax activation layer.

• Number of hidden units. The first LSTM layer has 128 hidden unitswhile the

second one has 64 units. These values, which are also relatively popular in the

literature, seemed to capture the dependencies best during training and generate

the best musical output.


• Dropout rate. For both dropout layers, the dropout value chosen is 0.5, as used

in Johnson (2017b) and Moon et al. (2015).

• Optimizer and its parameters. The training was optimized using RMSProp with

with a learning rate of 0.001 and Nesterov momentum of 0.9, as used in John-

son (2017b). These are the conventional values mostly used by default for this

algorithm and were not changed.

• Batch size and number of epochs. 400 epochs of training with a batch size of 128

were used. The epoch number is overestimated, which has no impact since early

stopping was used. This will be elaborated on in the next chapter describing the

experiment.

Figure 3.4: Bach model architecture

3.4.3 Sampling the Next Time-step

During training, the final softmax layer learns the probability distribution of the next

note based on the previously observed sequence and, during prediction, transforms the

output of the final dense layer into a probability distribution Pnextnote using the softmax

function, which is a normalized exponential function (Bishop et al., 1995). For dense


layer output φ and k possible notes, the output of the softmax activation layer for note

i would be:

Pnextnote=i =exp(φi)

∑kj exp(φ j)

, (3.1)

Given this distribution Pnextnote, we sample the next note, concatenate it to the

piano-roll sequence, move the history window one time-step ahead, then repeat the

process for the next prediction. Due to computational restrictions and the training cost,

the model only learns based on the previous 40 time-steps, which is two and a half bars

back in time.

It should be noted that, included in the probability distribution Pnextnote along with

the notes in the range, the probability of the sustain bit it also calculated. If sampled,

the sustain bit would mean that no note is to be played and the last note would be

elongated by one sixteenth note duration. The functionality of sampling a rest (a beat

of silence in which no note is played and also the sustain bit is off, which cuts the

previous note short and results in a time step of silence) is not included in the model,

as the training data did not have rests except in very rare occasion and it was decided

that they are not indicative of the style for practical purposes.

3.4.4 Conceptual Blending

Given the two generative model, the next step was to investigate different methods

of applying conceptual blending to them. This is not a trivial task, as the theoretical

concept of combinational creativity, which humans do seemingly trivially and on a

daily basis, is difficult to recreate computationally (CoInvent, 2017). To tackle this

task, two different approaches of combining the models were hypothesized, one of

which proved more successful than the other.

The first proposed method of combining the models was intuitive. A seed would

be given to one of the two models and its predictions will be sampled to generate

two measures of music (32 time steps), then the active model will be switched and

the other model would generate two measures and so on. This method of alternating

between the models to extrapolate the melody seems naive at first glance, but, upon

examination considering that the history window upon which the prediction is made is

40 time-steps (two measures and a half), the plausibility and justifiability of the model

becomes clearer.


Since the prediction is based on a widow of length longer than the sequence gener-

ated by the previous model(32 time-steps), this means that the output of the previous

model, as well as the output of the current model from its last iteration will also be

taken into account when predicting the sequence of notes and not simply extrapolating

the previous models output. This results in the output of the sequences being pre-

dicted increasingly upon the mixture of both models and, in an accumulative process,

generating better blends as the sequence length increases.

The second model is theoretically driven, from a probabilistic viewpoint, but did

not prove very successful practically.It makes use of the probability distribution gen-

erated by the predictive models and attempts to combine them to generate a blended

probability distribution that captures the prediction of both models. Starting with a

seed, the two models would be fed the history sequences in parallel, then the generated

probability distributions over the notes would be added or multiplied (both methods

were attempted) and normalized then a note will be sampled from this acquired joint

distribution and the process repeated.

As previously stated, the input and output lengths of the models are not equal, since

the range of allowed notes of a model is determined by the highest and lowest notes in

its training corpus, and it was observed that the difference in range was not significant,

as the folk music model’s range was 3 notes more than the Bach model. This discrep-

ancy in the range and, consequently, input/output vector length was resolved, based on

the situation, as shown below.

• When feeding the Bach output to the Folk model, the length of the vector was

increased to fit by padding the note indices not present in the Bach model with

zeros before feeding the sequence to the Folk model.

• In the inverse case, where the Bach model is expecting a shorter input length,

the notes outside of its range were clipped off and only the range in the range it

learned were fed to it. This is a shortcoming of the model, since some informa-

tion is omitted, but it is assumed the notes that are outside the range are extreme

pitches(too high or too low) and would appear very infrequently, if at all, in the

Folk model’s predictions in the first place. These sporadic, very sparse cases

where they would be present, are assumed to not greatly affect the prediction

by the Bach model, as it is likely to only be one time-step at most from the 40

previous time-steps it bases its prediction on.

• In case of adding or multiplying the two predicted distributions, the prediction


of the shorter output (Bach) was padded with zero, indicating a zero chance of

the notes outside its training to be played. This would give the unshared notes a

probability of zero in case of multiplying the distribution and, in case of addition,

a low probability of being played relative to the other notes, whose probabilities

are a sum of two numbers as opposed to being a sum of one probability and zero.

Chapter 4

Experiments

In this chapter, the details of running the experiments and training the generative mod-

els will be discussed as well as some design decisions that were attempted while build-

ing the music generation system but were discarded as the design and implementation

evolved.

While the chapter divide at this point is seemingly arbitrary and the information

presented herein could be incorporated, based on its nature, into the methodology and

evaluation chapters, it seemed more fitting, and convenient for reference, to enclose

all information pertaining to the direct running of the experiments and the learning

process and model performance in a separate chapter, as these are distinct from the

model architecture and design aspects. Similarly, it was decided that the different

design approaches and model iterations, their shortcomings and the reasons they were

discarded fall under the category of ‘experimentation’ during the system design and,

consequently, were included in this chapter. These unsuccessful iterations are only

included for thoroughness and for the intermediate stages of the project and evolution

of the model to be reported and for their shortcomings to be noted and documented for

future research.

4.1 Loss Function

Since the output of the network is supposed to model a probability distribution, the

loss function used for training the model is the categorical cross-entropy error. Cross

entropy measures the discrepancy between a predicted probability distribution q and a

true, or target, probability distribution p. ‘Categorical’ cross entropy is the name given

to a special cross entropy loss function available in the Keras deep learning library

24

Chapter 4. Experiments 25

(Chollet et al., 2015), used when the target distribution is one-hot encoded, meaning

one bit, or ‘class,’ is switched on and the rest are set to zero, which is true of our

representation (refer to section 3.3). The cross entropy for discrete p and q is defined

as shown in equation 4.1.

H(p,q) =−∑n

p(x) logq(x) (4.1)

4.2 Training the Models

The model architecture described was implemented using Keras and Tensorflow and

the training ran for 400 epochs with a batch size of 128 and early stopping with a

patience of 5. This patience value indicates that training will be stopped after five

epochs where there is no improvement in the loss, indicating the training is no longer

sufficiently effective.

After running the training scripts with the early stopping callback feature, the Folk

model stopped at epoch 159 with a final loss of 0.1878 while the Bach model stopped at

epoch 177 with a final training loss of 0.3690, as shown in the plots below, illustrating

training loss against epoch. This difference could be due to the small Bach training set

compared to the traditional tunes (see section 3.1) or the complexity and variation of

Bach’s music as opposed to the traditional folk tunes, which are mostly simple melodic

lines following basic harmonies with few accidental notes and variation in any given

piece. These factors are likely to makes Bach’s music harder to model and predict.

This loss serves training and comparing models well, but, while indicative of how

well the learned probability distributions match the original style, is hardly sufficient

for evaluating the pleasantness of a musical output. For this reason, the loss values

are only mentioned in this section while describing the training and not mentioned or

analyzed further in the later chapter on evaluating the output.


(a) Folk model

(b) Bach model

Figure 4.1: Model training losses against epochs


4.3 Intermediate Stages and Experimentation

During the design process, and over time, before arriving at the final design decisions

and architecture described in the previous chapter, different approaches were imple-

mented and tested with varying levels of success and failure. These are listed and

described below.

• Different model parameters. Before arriving at the architecture described, dif-

ferent parameters of the neural network were tested, specifically relating to the

number of hidden units in the LSTM layers. The first LSTM layer initially had

64 recurrent units (matching the second layer), as opposed to 128 in the final

model, but did not achieve interesting or pleasant results that properly emulated

the training music. This is possibly due to the fact that the less the number of

units, the weaker the representation power of the network is and, therefore, the

earlier architectures with smaller hidden layers could not capture the complex

nature of the musical styles and the relationships between the notes.

Another different parameter used in earlier stages was the dropout rates in the

two dropout layers. Dropout rates of 0.2 and 0.4 were tested, but, upon examin-

ing the output, the network seemed more prone to overfitting its weights during

training to the music, resulting in a lot of musical phrases exactly copying seg-

ments in the training data. This is not the desired goal of the system, as the goal

is to model the styles as closely as possible but not copying the music exactly.

• Different data representation. The initial data representation used was much

simpler, and arguably significantly more primitive and naive, where the music is

represented as a series of vectors of length 2, with the first value being the midi

integer value for the pitch and the second being the float value representing the

duration of the note (a quarter beat being 0.25, a whole beat 1.0, etc). This re-

sulted in a very messy output not faithful to the correct key of the training pieces

and outputting mostly accidental notes outside of the correct key and harmony,

sounding very chaotic and unpleasant. This is because the final layer is a dense

layer outputting an approximated value as close as possible to the pitch value

believed to be true (as opposed to the softmax layer giving a probability distri-

bution over the note classes), as in a regression task, which is usually much less

accurate than the classification task implied by the one hot encoding of the piano

roll notation, where the different possible notes in the next time-step are mod-


eled as different classes. As a result of this, an explicit rule-based system had

to be implemented through which the output is passed and the pitch values are

rounded to the closest pitch in the scale. This is undesirable and inaccurate as it

forces the whole piece outputted to fit in the seven notes of the scale, while the

original pieces did have accidental notes from outside the main tonality of the

key, but used properly and in a calculated fashion without sounding dissonant

and out of place.

Moreover, the duration values predicted were mostly impossible fractions and

durations lying between the normal segmentation of the music and were simi-

larly forced to be rounded to the nearest quarter, which also adds inaccuracy and

unfaithfulness to the style learned and outputted by the neural network. This

data representation also does not allow for a probabilistic sampling of the next

time-step, since the output is deterministic and does not allow for uncertainty

or randomness in the sequence generated, the shortcomings of which will be

discussed in the next point.

• Different time-step determination method. As previously discussed and even

when deciding on the piano roll and softmax implementation, the initial method

for determining the next note was deterministic. Given the softmax probability

distribution output, the note with highest probability was automatically sampled

and added to the generated sequence, resulting in the piece with maximum like-

lihood being generated. This resulted in uninteresting results and the same musi-

cal output being generated at every run of the models. This lack of stochasticity

also resulted in some overfitting, since the most probable notes were always

picked and, consequently, some segments of the music were exactly copying

segments from the training data. Observing this, the decision to make use of the

predicted distribution by sampling from it was taken and, as a result, some ran-

domness of the decision was added, resulting in a more interesting output that

easily got out of overfitting after a short segment of notes, since the stochasticity

decreases the possibility of overfitting over a longer sequence of notes.

Chapter 5

Evaluation

Having trained the models and achieved acceptable results (given the time constraint)

and, subsequently, tested the conceptual blending models hypothesized, the next step

was to evaluate the outcomes of the models in a way that conveys to some extent of

certainty how well the models perform and how acceptable the generated music is and

how much it captures the given musical style. Given the nature of the topic and the

data at hand, this is far from an easy feat, as previously stated when proposing this

project (2018), reiterated and elaborated on below.

5.1 On Evaluating Music Composition Models

A glaring, immediately observable complication on surveying the field of algorith-

mic composition, or computational creativity in general, is the lack of agreed upon,

objective evaluation methods by which to compare the generated outputs, and, con-

sequently, the models used to generate them. As is clear in surveys of the field, a

lot of papers choose to use human experts to evaluate the generated music, which is

not unreasonable, considering music is made primarily to be enjoyed by humans, and

not to be scrutinized by a rigid mathematical or probabilistic formula. Other papers,

such as Cherla et al. (2015) use cross entropy to probabilistically compare the pieces,

which is not necessarily indicative of aesthetic pleasantness, but allows for objective

comparison of different models, as used in the models proposed in this paper.

Some experiments use a combination of methods for evaluation, like the two step

evaluation described in Adiloglu and Alpaslan’s (2007) experiment, where they use

the rules of counterpoint to asses the pieces based on how many rules were violated in

the first step. Others use methods from natural language processing, like the overall

29

Chapter 5. Evaluation 30

probability of a given generated piece(similar to the overall probability of a sentence).

However, it remains clear that this is an open area of research as well as debate as to

whether it is fitting to come up with one objective measure of evaluation or leave it to

humans to evaluate the aestheticism of generated music.

Mozer (1994) discusses this point as it pertains to Neural Networks in particular,

given that ANN systems are mostly discussed informally and with no strict critical

measures, possibly due to the fact that their outputs are presumed to be successful

by definition. Nierhaus (2009) comments on this, restating our previously mentioned

claim that “this criticism points out a common lack in a number of publications in the

field of algorithmic composition.” This is definitely a gaping area in the field that needs

further discussion and examination.

5.2 On Evaluating Conceptual Blending Models

Since this part of the project builds primarily on the field of computational creativity

and music, but as it pertains to blending concepts and the resulting output of a model

that successfully does that, this significantly complicates the already existing problem

of evaluation in algorithmic composition. Moreover, since the output of two models

will be blended by a third, model, it would be unrealistic to evaluate this combined

output using the same metrics used by one of the two models. Due to this complication

and the nature of the problem at hand and its entanglement with human aesthetics and

subjective disposition, it seems reasonable to use humans to judge the outcome of the

blending model and whether it follows the rules of both styles which we are attempting

to blend and, consequently, sounds like the two styles simultaneously.

Zacharakis et al. (2015; 2017) address this problem of evaluating the output of a

conceptual blending musical system and use a similar approach for empirical evalua-

tion of the musical cadence blendoids. They used musically trained participants in two

different tests. The first was a “nonverbal dissimilarity rating listening test” and the

other was a verbal descriptive test that was more subjective and more useful “to assess

the qualities of the produced blends.”

Further to the problem of evaluating conceptually blended models, it is not hard

to imagine that, regardless of the application domain in question, it often comes as

a formidable mental exercise to think of a way to evaluate what constitutes a ‘good’

blend. To demonstrate this, the canonical example mentioned earlier (see section 2.5.1

of the ‘boat’ and ‘house’ conceptual spaces and their many possible blends, as dis-


cussed in (Goguen, 2006), can be examined.

Given these two spaces and some (hypothetical) blends of them, as perceived by

different individuals, what, if any, would constitute an objectively better blend than

others? Would a house capable of floating on water be more acceptable than a boat that

also functions as a living space? Moreover, an even naiver, yet still valid, reflection on

the topic: would a yacht, fully equipped with all the luxuries needed to be perfectly

livable and indeed functioning as the living space for some hypothetical hermit, count

as an acceptable blend, or merely a variant of a boat with no ‘house’ aspect to it and is

therefore not a ‘blendoid?’

Upon examining the previous questions and the clearly subjective nature of the an-

swers, it could be reasonably concluded that attempting to find an objective evaluation

of a blend is unrealistic, and, at least in the creative application domains, a subjective

participant evaluation, like the one described above in Zacharakis et al. (2015; 2017),

is more suitable. The empirical participant evaluation will be the main evaluation used

to measure the model’s success.

The results of the evaluations of the models are presented below.

5.3 Analyzing Musical Outputs Against Training Data

Despite having concluded that, for a creative output, a subjective evaluation is best,

it was also appropriate to attempt to gain some quantitative insight on how well the

models captured the style and patterns present in the training data. The two models

achieved different levels of success for reasons that will be speculated and explained

when possible. The results of this analysis are presented below.

Observing the histograms of the pitch class frequencies shown below, a few defin-

ing characteristics can be readily observed. In both Bach and traditional music, and as

expected from basic music theory, the most frequent pitch is G, which is the central

tonality of the G major key. The notes that follow in frequency after that are, also as

expected, D and B, which, along with G make up the G-major tonic triad. After that,

the next most frequent notes are A and F#, which are part of the dominant triad (D,

F# and A), followed by C and E, which are part of the sub-dominant triad (C, E and

G). One exception is that Bach tends to use pitch class A very frequently, even more

than D and B, the two notes that make up the main triad after G, which is unnatural

(but better left to musicologists to analyze). Moreover, a few outside notes are used

infrequently in the pieces, namely C#, Eb, F, G# and Bb. These are pitches not in the


main scale but used occasionally to make the music more interesting and unexpected.

Observing the pitch class frequencies in the models’ generated outputs, it is clear

that the folk model learned the correct key of the pieces and managed to generate music

with almost the exact same note frequencies as the original style without any explicit

rules to use notes from that key. The Bach model, on the other hand, did use a lot of

the notes successfully, but with some anomalies and strange frequencies not present

in the original Bach suites. It can be seen that the most used notes are D and C, as

opposed to the tonic G. It also used the outside notes Eb and F more frequently than

their counterparts present in the key of G major, E and F#, respectively, which is in

violation of tonal music theory and the Bach suites music style. This is most likely

due to the fact that the training dataset is too small (see section 3.1), which is famously

undesirable, as made abundantly clear in machine learning literature, coupled with the

fact that Bach’s music is very complex and the dependencies and patterns cannot be

easily learned.

Another pattern that becomes clear upon observing the weighted scatter diagrams

(which were produced after excluding triplets and complex durations that would be

removed in preprocessing) of the music is Bach’s persistent use of 16th notes (quarter

beats) and, less frequently, eighth notes. The folk pieces, on the other hand, use mainly

eighth and quarter notes, which is not unexpected given that they are usually simple

melodies that can be sung and played by relatively amateur musicians. It seems like

the two models did not manage to capture the note durations very accurately, as can

be seen by their scatter diagrams.The music of both models seems to choose one note

duration over the others and seems to always favor that note duration.

The note duration problem, intuitively, could be due to the data representation used

(see section 3.3), which results in the sustain bit being on too frequently during training

and is therefore given a lot of weight and very high probability during prediction.

However, while this explains the long sustained notes, it does not explain the emphasis

on a certain time duration for each model (half beat and two beats in the Bach and folk

models, respectively). It should also be noted that the two models would consistently

(at every run of the program, outputting different predictions) output a segment of

interesting music with varying durations more in line with the ones observed in the

pieces before getting stuck in generating these relatively longer duration notes.

Worth mentioning is the fact that this duration difference from the original music

is not particularly relevant or detrimental to the actual use of notes to model the style

since this is an issue of tempo that could be manipulated after the output is acquired


(eg. two half notes at a certain tempo would sound the same as two quarter notes half

the tempo, jut notated differently). After being sped up, the pieces sounded pleasant

and relatively similar to the styles modeled. This, however, is an issue still worth

examining, as note durations and tempo is an important characteristic of music styles

and, consequently, music style modeling that would not have been disregarded if it

weren’t for the time constraint and the fact that this project is primarily concerned with

melody style as opposed to beat and rhythm.

The next section will present the results of the subjective empirical evaluation by

the participants regarding the two music style models as well as one of the two concep-

tual blending models proposed, namely the one in which the two models alternate in

predicting the measures. The reason the output of the other blending model, in which

the two generative models predict the next time-step in parallel and the two probabili-

ties combined (either by addition or multiplication), normalized then sampled from, is

that it was deemed not suitable to be presented or examined further as it mostly com-

prised of very long notes sustained over many bars and very rarely changing. It is not

difficult to conclude that this is due to the aforementioned already high probability of

sustaining a note, along with the fact that the two models are added together, giving

it an even bigger weight and probability when the new blended distribution is normal-

ized. The repeating note pitch can also be explained by the same reason, as the notes

more central to the key will have higher probabilities in both distributions, resulting in

a very high probability in the blended probability distribution.


(a) Original Bach pieces (b) Original Folk pieces

(c) Bach model output (d) Folk model output

Figure 5.1: Pitch class frequency histograms


(a) Original Bach pieces (b) Original Folk pieces

(c) Bach model output (d) Folk model output

Figure 5.2: Pitches and note lengths weighted scatter diagrams

5.4 Empirical Evaluation

For the subjective evaluation of a human audience, 11 participants, out of a bigger

number that was planned to be acquired but time did not permit, were given question-

naires and the music played to them and they answered some questions about it. The

questioned included rating the similarity of the music or how well they think the AI

model captured the style on a conventional five point scale, where 1 through 5 indicate

very poor, poor, average, good, and excellent, respectively. As for the conceptual


blending model, a different five point scale was given, shown below, and the users

were asked to pick the number corresponding to the style they felt the blended music

represented and how fairly it represents the two given genres.

1. Exclusively folk

2. Predominantly folk, only incidentally similar to Bach

3. Equally folk and Bach style

4. Predominantly Bach, only incidentally similar to folk reels

5. Exclusively Bach

The results were generally favorable and in line with what has been observed up

to this point. The average rating of the users for the Bach and folk models were 3.27

and 3.55, respectively, indicating a slightly above average success of the models. The

results, shown below, also reveal a lack of a unique mode for the folk model and,

despite a mode of 3 for the Bach model, a notable spread of opinions over the spectrum.

This points further to the subjectivity of music and lack of uniformity in perceiving it,

making it difficult to asses such a system of algorithmic composition with anything

other than a large sample of participants.

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1 2 3 4 5

Bach Folk

Figure 5.3: Participant rating of the style models’ accuracy showing notable spread of

opinions


The conceptual blending model similarly achieved very favorable results, displayed

below as a continuous bell curve to better visualize the interesting polarity of the par-

ticipant opinions. The average rating on the presented music style scale is 3.18, in-

dicating what could be perceived as a very successful, almost equal blend of the two

music styles. The slight skew to the right of the curve is possibly due to the relatively

fast tempo the blended output was played in, which made it sound similar to Bach’s

music, as one participant commented in an optional “open comment” question at the

end of the questionnaire that “the music is too fast to be folk.”

These results demonstrate the success of the conceptual blending model which is

likely to be equally successful in other, similar application domains in which a blend

between two probabilistic history-based sequence prediction models is desired. How-

ever, the possible looming factors that could result in inaccuracies during evaluation

should also be noted and taken into account when evaluating other models. In this

project these contributing factors included the tempo differences in the music (which

are in no way learned or decided by the model), which could affect the opinion of par-

ticipants with little musical backgrounds. This lack of musical background is in itself

another factor, as one participant with little music background anecdotally and off-

handedly remarked that he was merely guessing when asked which piece he believes

is the original and which is generated by the machine learning model.

0

1

2

3

4

5

6

1 1.5 2 2.5 3 3.5 4 4.5 5

Figure 5.4: Participant rating of the blending model showing polarization of opinion

Chapter 6

Conclusion

To fully comprehend the aim and scope of the project and put it in context, it is im-

portant to understand the motivation and train of thought that led to it, including the

series of events through which the project went before crystallizing in the form pre-

sented herein. This section will attempt to clarify this and, hopefully, shed light on

some of the research that has been done as well as suggest future work and directions

that would have been interesting to pursue if time permitted.

6.1 Context and Motivation

The project was motivated by the field of conceptual blending and the recent research in

computationally modeling the cognitive process of blending two concepts to achieve a

novel output. More specifically, the aim was to build on Project Coinvent (2017) which

aimed to “develop a computationally feasible, cognitively-inspired formal model of

concept creation, drawing on Fauconnier and Turner’s theory of conceptual blending”

and enclosed some research on the conceptual blending of musical models, cited earlier

in this report (see chapter 2).

While the main aim was to expand the body of research on computational concep-

tual blending of music models, a fundamental part of that was the original constituent

models to be blended. The initial plan was to acquire two models presented by other

researchers from the vast body of works on probabilistic music style modeling. How-

ever, due to reasons including outdated code repositories, non-responsive authors and

incomplete, shallow access to models (eg. only access to final output) that does not

allow for custom manipulation of the predictions, this plan did not go through and no

suitable algorithmic composition models were acquired.

38

Chapter 6. Conclusion 39

Stemming from this, it was decided that the project be augmented by the first part,

in which two probabilistic music style models were designed and implemented to be

used in the second stage, in which the computational model for conceptual blending

was presented and tested. This sudden change of the scope of the project and the

consequent increase of time restriction was a significant challenge and stood in the

way of many possibilities worth pursuing, which will be discussed later.

6.2 Summary and Future Work

A deep learning model architecture was presented and implemented that was success-

ful in capturing the two styles chosen for the task. Participant surveys showed a general

satisfaction with the extent to which the styles were modeled and a tendency towards

a slightly above average rating of the models. There were a few shortcomings of the

models that were discussed and speculated upon and could be focused on further in the

future.

Having successfully acquired the two models, two conceptual blending models

were proposed, one of which did not achieve acceptable results to proceed to the par-

ticipant evaluation section and was therefore discarded. The other proposed model

achieved very good results and participant valuation showed that, on average, the

model successfully blends the two constituent musical styles into a new output that

captures characteristics from both of them. It is presumed that this proposed computa-

tional model for conceptual blending would prove equally successful in blending other

sequence prediction models, as it is agnostic to the application domain and does not

use specialized musical knowledge or rules in any way.

From this concluding note, below are some possible directions for future research

that could have been pursued if it weren’t for the time constraint.

• An intuitive extrapolation of this research would be to test the proposed concep-

tual blending approach on other models and application domains and examine

the output to see if it achieves equally good results. This could be in either sim-

ilar models that extrapolate a long history sequence by predicting the next step

or even in simpler Markov models that only take a very short history in con-

sideration and applying the concept of alternating different predictive models to

it.

• The music style models can be augmented in many different ways to achieve

Chapter 6. Conclusion 40

better results. The literature on this is extensive and covers many aspects of mu-

sic, but specifically improving the model proposed here would involve a better

music representation that is more suitable to capture the durations and allows the

rhythm to be learned better, as opposed to the one used here, which is problem-

atic for the reasons explained in the chapter 5.

• As previously stated, the literature on evaluating both algorithmic composition

and conceptual blending models is fraught with obstacles and issues that make it

very difficult to reach a method of objective evaluation through which different

models can be compared. More research certainly needs to be done to fill this

gap and, hopefully, get a clearer, more quantitative metric to compare against.

• More obviously, more research on computational conceptual blending needs

to be undertaken, whether that is research in different domains where concep-

tual blends could be of use, or on different computational models of conceptual

blending that can successfully generate a novel, creative output given two differ-

ent conceptual spaces.

Appendix A

Participant survey

1. Do you have any musical background? Please rank your knowledge of music

on a scale of 1 to 5, with 1 indicating no musical background and 5 indicating

full professional proficiency of a musical instrument(or vocal performance) and

music theory.

O 1 O 2 O 3 O 4 O 5

2. You will now listen to two pieces of music, one of which is a traditional reel(dance)

and another by an artificial intelligence agent trained on traditional British and

American folk reels? After listening to both pieces, please indicate which one

you believe is a real traditional tune and which is composed by the AI (please use

“1” to indicate the first of the two pieces you just listened to, and “2” to indicate

the second).

traditional .........

AI .........

3. To what extent would you say the AI succeeded in modeling the given musical

style (you may think of this rephrased as ‘how similar is the style of the two

pieces’)? Please give a ranking on a scale of 1 to 5, with 1 indicating poor per-

formance or “no similarity in style” and 5 indicating excellent performance or

“identical style” between the two compositions.

O 1 O 2 O 3 O 4 O 5

41

Appendix A. Participant survey 42

4. You will now listen to two pieces of music, one of which is composed by Bach

and another by an artificial intelligence agent trained on pieces written by Bach

and posing as him. After listening to both pieces, please indicate which one you

believe is composed by Bach and which one you believe is by the AI (please use

“1” to indicate the first of the two pieces you just listened to, and “2” to indicate

the second).

Bach .........

AI .........

5. To what extent would you say the AI succeeded in modeling the given musical

style (you may think of this rephrased as ’how similar is the style of the two

pieces’)? Please give a ranking on a scale of 1 to 5, with 1 indicating poor per-

formance or “no similarity in style” and 5 indicating excellent performance or

“identical style” between the two compositions.

O 1 O 2 O 3 O 4 O 5

6. You will now listen to a piece that attempts to combine aspects of the two pre-

vious music styles? Please characterise the style of the final piece using the

following scale:

O Exclusively folk

O Predominantly folk, only incidentally similar to Bach

O Equally folk and Bach style

O Predominantly Bach, only incidentally similar to folk reels

O Exclusively Bach

Appendix A. Participant survey 43

If you have any further comments on the experiment, algorithmic composition,

computer creativity, the output of the AI composers in relation to the original

pieces they learned from or insights on how it decided to blend the two styles

into one piece, please elaborate below.

...............................................................................................................

...............................................................................................................

...............................................................................................................

...............................................................................................................

...............................................................................................................

...............................................................................................................

...............................................................................................................

...............................................................................................................

...............................................................................................................

...............................................................................................................

...............................................................................................................

...............................................................................................................

...............................................................................................................

...............................................................................................................

...............................................................................................................

...............................................................................................................

Bibliography

Adiloglu, K. and Alpaslan, F. N. (2007). A machine learning approach to two-voice

counterpoint composition. Knowledge-Based Systems, 20(3):300–309.

Ames, C. (1987). Automated composition in retrospect: 1956-1986. Leonardo, pages

169–185.

Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with

gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166.

Bishop, C., Bishop, C. M., et al. (1995). Neural networks for pattern recognition.

Oxford university press.

Boden, M. A. (2009). Computer models of creativity. AI Magazine, 30(3):23.

Boenn, G., Brain, M., De Vos, M., et al. (2008). Automatic composition of melodic

and harmonic music by answer set programming. In International Conference on

Logic Programming, pages 160–174. Springer.

Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2012). Modeling tempo-

ral dependencies in high-dimensional sequences: Application to polyphonic music

generation and transcription. arXiv preprint arXiv:1206.6392.

Bowles, E. (1970). Musicke’s handmaiden: Or technology in the service of the arts.

The computer and music, pages 3–20.

Cambouropoulos, E., Kaliakatsos-Papakostas, M., and Tsougras, C. (2015). Structural

blending of harmonic spaces: a computational approach. In Proceedings of the 9th

Triennial Conference of the European Society for the Cognitive Science of Music

(ESCOM).

Cherla, S., Tran, S. N., Weyde, T., and Garcez, A. S. d. (2015). Hybrid long-and

short-term models of folk melodies. In ISMIR, pages 584–590.

44

Bibliography 45

Chollet, F. et al. (2015). Keras. https://keras.io.

CoInvent, P. (2017). Concept invention theory project. In Concept Invention Theory

Project Abstract. European Commission.

Conklin, D. (2003). Music generation from statistical models. In Proceedings of

the AISB 2003 Symposium on Artificial Intelligence and Creativity in the Arts and

Sciences, pages 30–35. Citeseer.

Cope, D. and Mayer, M. J. (1996). Experiments in musical intelligence, volume 12.

AR editions Madison, WI.

Cuthbert, M. S. and Ariza, C. (2010). music21: A toolkit for computer-aided musicol-

ogy and symbolic music data.

Eck, D. and Schmidhuber, J. (2002). A first look at music composition using lstm

recurrent neural networks. Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale,

103.

Eppe, M., Confalonieri, R., Maclean, E., Kaliakatsos, M., Cambouropoulos, E., Schor-

lemmer, M., Codescu, M., and Kuhnberger, K. (2015). Computational invention of

cadences and chord progressions by conceptual chord-blending. AAAI Press; Inter-

national Joint Conferences on Artificial Intelligence.

Fauconnier, G. and Turner, M. (2008). The way we think: Conceptual blending and

the mind’s hidden complexities. Basic Books.

Goguen, J. A. (2006). Mathematical models of cognitive space and time. In Andler,

D. et al., editors, Reasoning and Cognition: Proceedings of the Interdisciplinary

Conference on Reasoning and Cognition, pages 125–148, Tokyo. Keio University

Press.

Hadjileontiadis, L. J. (2014). Conceptual blending in biomusic composition space:

The “brainswarm” paradigm. In ICMC.

Haigh, C. (2014). Exploring folk fiddle - an introduction to folk styles, technique and

impro. Schott & Co.

Herremans, D., Sorensen, K., and Conklin, D. (2014). Sampling the extrema from

statistical models of music with variable neighbourhood search.

Bibliography 46

Hild, H., Feulner, J., and Menzel, W. (1992). Harmonet: A neural net for harmoniz-

ing chorales in the style of js bach. In Advances in neural information processing

systems, pages 267–274.

Hiller Jr, L. A. and Isaacson, L. M. (1957). Musical composition with a high speed

digital computer. In Audio Engineering Society Convention 9. Audio Engineering

Society.

Johnson, D. D. (2017a). Generating polyphonic music using tied parallel networks. In

International Conference on Evolutionary and Biologically Inspired Music and Art,

pages 128–143. Springer.

Johnson, D. D. (2017b). Generating polyphonic music using tied parallel networks.

In Correia, J., Ciesielski, V., and Liapis, A., editors, Computational Intelligence

in Music, Sound, Art and Design, pages 128–143, Cham. Springer International

Publishing.

Kaliakatsos-Papakostas, M., Cambouropoulos, E., Kuhnberger, K.-U., Kutz, O., and

Smaill, A. (2014). Concept invention and music: creating novel harmonies via

conceptual blending. In In Proceedings of the 9th Conference on Interdisciplinary

Musicology (CIM2014), CIM2014. Citeseer.

Mao, H. H., Shin, T., and Cottrell, G. (2018). Deepj: Style-specific music generation.

In Semantic Computing (ICSC), 2018 IEEE 12th International Conference on, pages

377–382. IEEE.

Martin, A., Jin, C., van Schaik, A., and Martens, W. L. (2010). Partially observable

markov decision processes for interactive music systems. In Proceedings of the

International Computer Music Conference.

Miles Huber, D. (1991). The midi manual. USA. Howard W. Sams.

Moon, T., Choi, H., Lee, H., and Song, I. (2015). Rnndrop: A novel dropout for rnns

in asr. In Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE

Workshop on, pages 65–70. IEEE.

Mozer, M. C. (1994). Neural network music composition by prediction: Exploring

the benefits of psychoacoustic constraints and multi-scale processing. Connection

Science, 6(2-3):247–280.

Bibliography 47

Mumford, M. D. (2003). Where have we been, where are we going? taking stock in

creativity research. Creativity research journal, 15(2-3):107–120.

Nierhaus, G. (2009). Algorithmic composition: paradigms of automated music gener-

ation. Springer Science & Business Media.

Olson, H. F. (1967). Music, physics and engineering, volume 1769. Courier Corpora-

tion.

Pachet, F. and Roy, P. (2011). Markov constraints: steerable generation of markov

sequences. Constraints, 16(2):148–172.

Ponsford, D., Wiggins, G., and Mellish, C. (1999). Statistical learning of harmonic

movement. Journal of New Music Research, 28(2):150–177.

Quick, D. (2016). Learning production probabilities for musical grammars. Journal of

New Music Research, 45(4):295–313.

Raczynski, S. A. and Vincent, E. (2014). Genre-based music language modeling with

latent hierarchical pitman-yor process allocation. IEEE/ACM Transactions on Au-

dio, Speech, and Language Processing, 22(3):672–681.

Schmidhuber, J. and Hochreiter, S. (1997). Long short-term memory. Neural Comput,

9(8):1735–1780.

Swift, A. (1997). A brief introduction to midi. URL http://www. doc. ic. ac. uk/˜

nd/surprise 97/journal/vol1/aps2, 6.

Todd, P. M. and Werner, G. M. (1999). Frankensteinian methods for evolutionary

music. Musical networks: parallel distributed perception and performace, pages

313–340.

Turner, M. (2014). The origin of ideas: Blending, creativity, and the human spark.

Oxford University Press.

Zacharakis, A., Kaliakatsos-Papakostas, M., Tsougras, C., and Cambouropoulos, E.

(2017). Creating musical cadences via conceptual blending: Empirical evaluation

and enhancement of a formal model. Music Perception: An Interdisciplinary Jour-

nal, 35(2):211–234.

Bibliography 48

Zacharakis, A. I., Kaliakatsos-Papakostas, M. A., and Cambouropoulos, E. (2015).

Conceptual blending in music cadences: A formal model and subjective evaluation.

In ISMIR, pages 141–147.

Probabilistic Music Style Modeling and a Computational ... · chitecture for music style modeling....

Documents

Transcript of Probabilistic Music Style Modeling and a Computational ... · chitecture for music style modeling....