arXiv:2108.12290v2 [cs.SD] 7 Sep 2021

17
MUSIC C OMPOSITION WITH D EEP L EARNING :AR EVIEW Carlos Hernandez-Olivan * [email protected] Jose R. Beltran [email protected] Department of Engineering and Communications, Calle María de Luna, Universidad de Zaragoza ABSTRACT Generating a complex work of art such as a musical composition requires exhibiting true creativity that depends on a variety of factors that are related to the hierarchy of musical language. Music generation have been faced with Algorithmic methods and recently, with Deep Learning models that are being used in other fields such as Computer Vision. In this paper we want to put into context the existing relationships between AI-based music composition models and human musical composition and creativity processes. We give an overview of the recent Deep Learning models for music composition and we compare these models to the music composition process from a theoretical point of view. We have tried to answer some of the most relevant open questions for this task by analyzing the ability of current Deep Learning models to generate music with creativity or the similarity between AI and human composition processes, among others. Keywords Music generation · Deep Learning · Machine Learning · Neural Networks 1 Introduction Music is generally defined as a succession of pitches or rhythms, or both, in some definite patterns [1]. Music composition (or generation) is the process of creating or writing a new piece of music. The music composition term can also refer to an original piece or work of music [1]. Music composition requires creativity which is the unique human capacity to understand and produce an indefinitely large number of sentences in a language, most of which have never been encountered or spoken before [2]. This is a very important aspect that needs to be taken into account when designing or proposing an AI-based music composition algorithm. More specifically, music composition is an important topic in the Music Information Retrieval (MIR) field. It comprises subtasks such as melody generation, multi-track or multi-instrument generation, style transfer or harmonization. These aspects will be covered in this paper from the point of view of the multitude of techniques that have flourished in recent years based on AI and DL. 1.1 From Algorithmic Composition to Deep Learning From the 1980s the interest in computer-based music composition has never stop to grow. Some experiments came up in the earlys 1980s such as the Experiments in Musical Intelligence (EMI) [3] by David Cope from 1983 to 1989 or Analogiques A and B by Iannis Xenakis [4]. Later in the 2000s, also David Cope proposed the combination of Markov chains with grammars for automatic music composition and other relevant works such as Project1 (PR1) by Koening [5] were born. These techniques can be grouped in the field of Algorithmic Music Composition, which is a way of composing by means of formalizable methods [6][7]. This type of composing consists on a controlled procedure which is based on mathematical instructions that must be followed in a fixed order. There are several methods inside the Algorithmic Composition such as Markov Models, Generative Grammars, Cellular Automata, Genetic Algorithms, Transition Networks or Caos Theory [8]. Sometimes these techniques and other probabilistic methods are combined with Deep Neural Networks in order to condition them or help them to better model music, which is the case of DeepBach [9]. These models can generate and harmonize melodies in different styles, but the lack of generalizability * https://carlosholivan.github.io arXiv:2108.12290v2 [cs.SD] 7 Sep 2021

Transcript of arXiv:2108.12290v2 [cs.SD] 7 Sep 2021

Page 1: arXiv:2108.12290v2 [cs.SD] 7 Sep 2021

MUSIC COMPOSITION WITH DEEP LEARNING: A REVIEW

Carlos Hernandez-Olivan ∗

[email protected] R. Beltran

[email protected]

Department of Engineering and Communications, Calle María de Luna, Universidad de Zaragoza

ABSTRACT

Generating a complex work of art such as a musical composition requires exhibiting true creativitythat depends on a variety of factors that are related to the hierarchy of musical language. Musicgeneration have been faced with Algorithmic methods and recently, with Deep Learning modelsthat are being used in other fields such as Computer Vision. In this paper we want to put intocontext the existing relationships between AI-based music composition models and human musicalcomposition and creativity processes. We give an overview of the recent Deep Learning modelsfor music composition and we compare these models to the music composition process from atheoretical point of view. We have tried to answer some of the most relevant open questions for thistask by analyzing the ability of current Deep Learning models to generate music with creativity orthe similarity between AI and human composition processes, among others.

Keywords Music generation · Deep Learning · Machine Learning · Neural Networks

1 Introduction

Music is generally defined as a succession of pitches or rhythms, or both, in some definite patterns [1]. Musiccomposition (or generation) is the process of creating or writing a new piece of music. The music composition termcan also refer to an original piece or work of music [1]. Music composition requires creativity which is the uniquehuman capacity to understand and produce an indefinitely large number of sentences in a language, most of which havenever been encountered or spoken before [2]. This is a very important aspect that needs to be taken into account whendesigning or proposing an AI-based music composition algorithm.

More specifically, music composition is an important topic in the Music Information Retrieval (MIR) field. It comprisessubtasks such as melody generation, multi-track or multi-instrument generation, style transfer or harmonization. Theseaspects will be covered in this paper from the point of view of the multitude of techniques that have flourished in recentyears based on AI and DL.

1.1 From Algorithmic Composition to Deep Learning

From the 1980s the interest in computer-based music composition has never stop to grow. Some experiments came upin the earlys 1980s such as the Experiments in Musical Intelligence (EMI) [3] by David Cope from 1983 to 1989 orAnalogiques A and B by Iannis Xenakis [4]. Later in the 2000s, also David Cope proposed the combination of Markovchains with grammars for automatic music composition and other relevant works such as Project1 (PR1) by Koening[5] were born. These techniques can be grouped in the field of Algorithmic Music Composition, which is a way ofcomposing by means of formalizable methods [6] [7]. This type of composing consists on a controlled procedure whichis based on mathematical instructions that must be followed in a fixed order. There are several methods inside theAlgorithmic Composition such as Markov Models, Generative Grammars, Cellular Automata, Genetic Algorithms,Transition Networks or Caos Theory [8]. Sometimes these techniques and other probabilistic methods are combinedwith Deep Neural Networks in order to condition them or help them to better model music, which is the case ofDeepBach [9]. These models can generate and harmonize melodies in different styles, but the lack of generalizability

∗https://carlosholivan.github.io

arX

iv:2

108.

1229

0v2

[cs

.SD

] 7

Sep

202

1

Page 2: arXiv:2108.12290v2 [cs.SD] 7 Sep 2021

capacity of these models and the rule-based definitions that must be done by hand make these methods less powerfuland generalizable in comparison with Depp Learning-based models.

From the 1980s to the earlies 2000s, the first works which tried to model music with NNs were born [10] [11] [12][13]. In the recent years, with the growing of Deep Learning (DL) lots of studies have tried to model music with deepNeural Networks (NN). DL models for music generation normally use NN architectures that are proven to perform wellin other fields such as Computer Vision or Natural Language Processing (NLP). There can also be used pre-trainedmodels in these fiels that can be used for music generation. This is called Transfer Learning [14]. Some NN techniquesand architectures will be shown later in this paper. Music composition today is taking input representations and NNsarchitectures from large-scale NLP applications, such as Transformer-based models, which are demonstrating verygood performance in this task. This is due to the fact that music can be understood as a language in which every style ormusic genre has its own rules.

1.2 Neural Network Architectures for Music Composition with Deep Learning

First of all, we will provide an overview of the most widely used NN architectures that are providing the best resultsin the task of musical composition so far. The most used NN architectures in music composition task are GenerativeModels such as Variational AutoEencoders (VAEs) or Generative Adversarial Networks (GANs), and NLP-basedmodels such as Long Short-Term Memory (LSTM) or Transformers. The following is an overview of these models.

1.2.1 Variational Auto-Encoders (VAEs)

The original VAE model [15] uses an Encoder-Decoder architecture to produce a latent space by reconstructing theinput (see fig. 1a). A latent space is a multidimensional space of compressed data in which the most similar elementsare located closest to each other. In a VAE, the encoder approximates the posterior and the decoder parameterizes thelikelihood. The posterior and likelihood approximations are parametrized by a NN with λ and θ parameters for theencoder and decoder respectively. The posterior inference is done by minimizing the Kullback-Leiber (KL) divergencebetween the encoder or approximate posterior, and the true posterior by maximizing the Evidence Lower bound (ELBO).The gradient is computed with the so-called reparametrization trick. There are variations of the original VAE modelsuch as the β-VAE [16] which adds a penalty term β to the reconstruction loss in order to improve the latent spacedistribution. In fig. 1a we show the general VAE architecture. An example of a DL model for music composition basedon a VAE is MusicVAE [17] which we describe in further sections in this paper.

1.2.2 Generative Adversarial Networks (GANs)

GANs [18] are Generative Models composed by two NNs: the Generator G and the Discriminator D. The generatorlearns a distribution pg over the input data The training is done in order to let the discriminator maximize the probabilityof assigning the correct label to the training samples and the samples generated by the generator. This training idea canbe understood as if D and G follow the two-player minimax game that Goodfellow et al. [18] described. In fig. 1b weshow the general GAN architecture.

The generator and the discrimiator can be formed by different NN layers such as Multi-Layer Perceptrons (MLP) [19],LSTM [20] or Convolutional Neural Networks (CNN) [21] [22].

1.2.3 Transformers

Transformers [23] are being currently used in NLP applications due to their well performance not ony in NLP but alsoin Computer Vision models. Transformers can be used as auto-regressive models like the LSTMs which allow them tobe used in generative tasks. The basic idea behind Transformers is the attention mechanism. There are several variationsof the original attention mechanism proposed by Vaswani et al. [23] that have been used in the music composition task[24]. The combination of the attention layer with feed forward layers leads to the formation of the Encoder and Decoderof the Transformer, which differs from purely AutoEncoder models that are also composed by the Encoder and Decoder.Transformers are trained with tokens which are structured representations of the inputs. In fig. 1c we show the generalTransformer architecture.

1.3 Challenges in Music Composition with Deep Learning

There are different points of view from the challenges perspective in music composition with DL that make ourselvesask questions related to the input representations and DL models that have been used in this field, the output quality ofthe actual state-of-the-art methods or the way that researchers have measured the quality of the generated music. In thisreview, we ask ourselves the following questions that involve the composition process and output: Are the current DL

2

Page 3: arXiv:2108.12290v2 [cs.SD] 7 Sep 2021

(a)

(b)

(c)

Figure 1: a) VAE [15] b) GAN [18] and c) Transformer general architecture. Reproduced from [23].

models capable of generating music with a certain level of creativity? What is the best NN architecture to performmusic composition with DL? Could end-to-end methods generate entire structured music pieces? Are the composedpieces with DL just an imitation of the inputs or can NNs generate new music in styles that are not present in thetraining data? Should NNs compose music by following the same logic and process as humans do? How much datado DL models for music generation need? Are current evaluation methods good enough to compare and measure thecreativity of the composed music?

To answer these questions, we approach Music Composition or Generation from the point of view of the processfollowed to obtain the final composition and the output of DL models, i.e., the comparison between the humancomposition process and the Music Generation process with Deep Learning and the artistic and creative characteristicspresented by the generated music. We also analyze recent state-of-the-art models of Music Composition with DeepLearning to show the result provided by these models (motifs, complete compositions...). Another important aspectanalyzed is the input representation that these models use to generate music to understand if these representations are

3

Page 4: arXiv:2108.12290v2 [cs.SD] 7 Sep 2021

suitable for composing music. This gives us some insights on how these models could be improved, if these NeuralNetwork architectures are powerful enough to compose new music with a certain level of creativity and the directionsand future work that should be done in Music Composition with Deep Learning.

1.4 Paper Structure

In this review, we make an analysis of the Music Composition task from the composition process and the type ofgenerated output, and we do not cover the performance or synthesis tasks. This paper is structured as follows. Section2 introduces a general view of the music composition process and music basic principles. In Section 3, we give anoverview of state-of-the-art methods from the melodic composition point of view and we describe the DL models thathave been tested for composing structured music. In Section 4 we describe DL models that generate multi-track ormulti-instrument music, that is, music made for more than one instrument. In Section 5 we show different methods andmetrics that are commonly used to evaluate the output of a model for music generation. In Section 6 we describe theopen questions that remain in music generation by analyzing the models which we describe in sections 3 and 4. Finally,in Section 7 we expose future work and challenges that are still open in research.

2 The Music Composition Process

As in written language, the music composition process is a complex process that depends on a large number of decisions[25]. In the music field, this process [26] depends on the music style we are working with. As an example, it is verycommon in classical music to start with a small unit of one or two bars called motif and develop it to compose a melodyor music phrase, and in styles like pop or jazz it is more common to take a chord progression and compose or improvisea melody ahead of it. In spite of the music style we are composing in, when a composer starts a piece of music there issome basic melodic or harmonic idea behind it. From the classical music point of view, this idea (or motif ) is developedby the composer to construct the melody or phrase that follows a certain harmonic progression and then these phrasesare structured in sections. Each section has its own purpose so it can be written in different keys and its phrases usuallyfollow different harmonic progressions than the other sections. Normally, compositions have a melodic part and aaccompaniment part. The melodic part of a piece of music can be played by different instruments whose frequencyrange may or may not be similar, and the harmonic part gives the piece a deep and structured feel. The instruments,that are not necessarily in the same frequency range, are combined with Instrumentation and Orchestration techniques(see section 3.2). These elements are crucial in musical composition and also an important key in defining the style orgenre of a piece of music. Music, has two dimensions, the time and the harmony dimensions. The time dimensionis represented by the notes duration or rhythm, which is the lowest level in this axis. In this dimension, notes can begrouped or measured in units called bars, that are ordered groups of notes. The other dimension, harmony, is related tothe note values or pitch. If we think of an image, time dimension would be the horizontal axis and harmony dimensionthe vertical axis. Harmony does also have a temporal evolution but this is not represented in music scores. There is avery common software-based music representation called piano-roll that follows this logic.

The music time dimension is structured in low-level units that are notes, which are grouped in bars that form (motifs).In the time high-level dimension we can find sections which are composed by phrases that last eight or more bars (thisdepends on the style and composer). The harmony dimension lowest level is the note-level and then the superpositionof notes also played by different instruments gives us chords. The sequence of chords are called chord progressionsthat are relevant to the composition and they also have dependencies in the time dimension. Being said that, we canthink about music as a complex language model that consist on short and long-term relationships. These relationshipsextends in two dimensions, the time dimension which is related to music structure and the harmonic dimension which isrelated to the notes or pitches and chords, that is, the harmony.

From the symbolic music generation and analysis points of view, based on the ideas of Walton [27] the basic musicprinciples or elements are (see fig. 2):

• Harmony. It is the superposition of notes that form chords which compose a chord progression. The note-levelcould be considered as the lowest level in harmony being followed by the chord-level. The highest-level canbe considered as the progression-level which usually belongs to a certain key.

• Music Form or Structure. It is the high-level structure that music presents and it is related with the timedimension. The smallest part of a music piece is the motif which is developed in a music phrase and thecombination of music phrases form a section. Sections in music are ordered depending on the music style suchas intro-verse-chorus-verse-outro for some pop songs (also represented as ABCBA) or exposition-development-recapitulation or ABA for Sonatas. The concatenation of sections which can be in different scales and modesgives us the entire composition.

4

Page 5: arXiv:2108.12290v2 [cs.SD] 7 Sep 2021

• Melody and Texture. Texture in music terms refers to the melodic, rhythmic and harmonic contents that haveto be combined in a composition in order to form the music piece. Music can be monophonic or polyphonicdepending on the notes that are played at the same time step, homophonic or heterophonic depending on themelody, if it has or not accompaniment.

• Instrumentation and Orchestration. These are music techniques that take into account the number of instrumentsor tracks in a music piece. Whereas Instrumentation is related to the combination of musical instrumentswhich compose a music piece, Orchestration refers to the assignment of melodies and accompaniment tothe different instruments that compose a determined music piece. In recording or software-based musicrepresentation, Instruments are organized as tracks. Each track contains the collection of notes played on asingle instrument [28]. Therefore, we can call a piece with more than one instrument as multi-track, whichrefers to the information that contains two or more tracks where each track is played by a single instrument.Each track can contain one note or multiple notes that sounds simultaneously, leading to monophonic tracksand polyphonic tracks respectively.

Music categories are related between them. Harmony is related to the structure because a section is usually playedin the same scale and mode. There are cadences between sections and there can also be modulations which changethe scale of the piece. Texture and instrumentation are related to timbral features and their relationship is based onthe fact that not all the instruments can play the same melodies. An example of that is when we have a melody withlots of ornamentation elements which cannot be played with determined instrument families (because of a fact of eachinstrument technique possibilities or a stylist reason).

(a)

(b)

Figure 2: a) General music composition scheme and b) an example of the beginning of Beethoven’s 5th symphony withmusic levels or categories.

Another important music attribute are the dynamics, but they are related to the performance rather than the compositionitself, so we will not cover them in this review. In Fig. 2 we show the aspects of the music composition process thatwe cover in this review, and the relationships between categories and the sections of the paper in which each topic isdiscussed are depicted.

5

Page 6: arXiv:2108.12290v2 [cs.SD] 7 Sep 2021

3 Melody Generation

A melody is a sequence of notes with a certain rhythm ordered in a aesthetic way. Melodies can be monophonic orpolyphonic. Monophonic refers to melodies in which only one note is played at a time step whereas in polyphonicmelodies there is more than one note being played at the same time step. Melody generation is an important part ofmusic composition and it has been attempted with Algorithmic Composition and with several of the NN architecturesthat includes Generative Models such as VAEs or GANs, Recurrent Neural Networks (RNNs) used for auto-regressiontasks such as LSTM, Neural Autoregressive Distriution Estimators (NADEs) [29] or current models used in NaturalLanguage Processing like Transformers [23]. In Fig. 3 we show the scheme with the music basic principles of anoutput-like score of a melody generation model.

Figure 3: Scheme of an output-like score of melody generation models.

3.1 Deep Learning Models for Melody Generation: From Motifs to Melodic Phrases

Depending on the music genre of our domain, the human composition process usually begins with the creation of amotif or a chord progression that is then expanded to a phrase or melody. When it comes to DL methods for musicgeneration, several models can generate short-term notes sequences. In 2016, the very first DL models attempted togenerate short melodies with Recurrent Neural Networks (RNNs) and semantic models such as Unit Selection [30].These models worked for short sequences, so the interest to create entire melodies grew in parallel to the birth of newNNs. Derived from these first works and with the aim of creating longer sequences (or melodies), other models thatcombined NNs with probabilistic methods came up. An example of this is Google’s Magenta Melody RNN models[31] released in 2016, the Anticipation-RNN [32] or DeepBach [9] both published in 2017. DeepBach is consideredone of the current state-of-the-art models for music generation because of its capacity to generate 4-voices chorales inthe style of Bach.

However, these methods cannot generate new melodies with a high level of creativity from scratch. In order to improvethe generation task, Generative Models were chosen by researchers to perform music composition. In fact, nowadaysone of the best-performing models to generate motifs or short melodies from 2 to 16 bars is MusicVAE2 [17] which waspublished in 2018. MusicVAE is a model for music generation based on a VAE [15]. With this model, music can begenerated by interpolating in a latent space. This model is trained with approximately 1.5 million songs from the LakhMIDI Dataset (LMD)3 [33] and it can generate polyphonic melodies for almost 3 instruments: melody, bass and drums.After the creation of MusicVAE model along with the birth of new NN architectures in other fields, the necessity andavailability of new DL-based models that could create longer melodies grew. New models based on Transformers cameup such as the Music Transformer [24] in 2018, or models that used pre-trained Transformers such as the GPT-2 likeMuseNet in 2019 proposed by OpenAI [34] does. These Transformer-based models, such as Music Transformer, cangenerate longer melodies and continue a given sequence, but after a few bars or seconds the melody ends up being a bitrandom, that is, there are notes and harmonies that do not follow the musical sense of the piece.

In order to overcome this problem and develop models that could generate longer sequences without losing the sense ofthe music generated in the previous bars or the main motifs, new models were born in 2020 and 2021 as combinationsof VAEs, Transformers or other NNs or Machine Learning algorithms. Some examples of these models are the

2https://magenta.tensorflow.org/music-vae, accessed August 20213https://colinraffel.com/projects/lmd/, accessed August 2021

6

Page 7: arXiv:2108.12290v2 [cs.SD] 7 Sep 2021

TransformerVAE [35] and PianoTree [36]. These models perform well even in polyphonic music and they can generatemusic phrases. One of the latest released models to generate entire phrases is the model proposed in 2021 by Mittal etal. [37] which is based in Denoising Diffusion Probabilistic Models (DDPMs) [38] which are new Generative Modelsthat generate high-quality samples by learning to invert a diffusion process from data to Gaussian noise. This modeluses a MusicVAE 2-bar model to then train a Diffusion Model to capture the temporal relationships among the VAElatents zk with k = 32 which are the 32 latent variables that allows to generate 64 bars (2 bars per latent). In spite thatthere can be generated longer polyphonic melodies, they do not follow a central motif so they tend to loose the sense ofa certain direction.

3.2 Structure Awareness

As we mentioned in section 1, music is a structured language. Once melodies have been created they must be groupedinto bigger sections (see fig. 2) which play a fundamental role in a composition. These sections have different namesthat vary depending on the music style such as introduction, chorus or verse for pop or trap genres, and exposition,development or recapitulation for classical sonatas. Sections can also be named with capital letters and song structurescan be expressed as ABAB, for example. Generating music with structure is one of the most difficult tasks in musiccomposition with DL because structure means an aesthetical sense of rhythm, chord progressions and melodies that areconcatenated with bridges and cadences [39].

In DL there have been models that have tried to generate structured music by imposing the high-level structure withthe Self-Similarity constrains. An example of that is the model proposed by Lattner et al. in 2018 [39] which usesa Convolutional Restricted Boltzmann Machine (C-RBM) to generate music and Self-Similarity constrain with aSelf-Similarity Matrix [40] to impose the structure of the piece as if it was a template. This method which imposesa structure template is similar to the composition process that a composer follows when composing music and theresulting music pieces followed the imposed structure template. Although new DL models are trending to be end-to-end,and new studies about modeling music with structure are being released [41], there have not been DL models that couldgenerate structured music by themselves, that is, without the help of a template or high-level structure information thatis passed to the NN.

3.3 Harmony and Melody Conditioning

Inside music composition with DL there is a task that is the harmonization of a given melody which differs to thetask of creating a polyphonic melody from scratch. In one hand, if we analyze the harmony of a created melody fromscratch with a DL model, we saw that music generated with DL is not well structured as it does not compose differentsections and write aesthetic cadences or bridges between the sections in and end-to-end way yet. In spite of that, theharmony generated by Transformer-based models that compose polyphonic melodies is coherent in the first bars ofthe generated pieces [24] because it follows a certain key. We have to emphasize here that these melodies are writtenfor piano, which differs from multi-instrument music that presents added challenges such as generating appropriatemelodies or accompaniments for each instrument or deciding which instruments make up the ensemble (see section 4).

In the other hand, the task of melody harmonization consists of generating the harmony that accompanies a givenmelody. The accompaniment can be a chord accompaniment regardless of the instrument or the track where the chordsare, and multi-track accompaniment where the notes in each chord belong to a specific instrument. Firsts models forharmonization used HMM, but these models were improved by RNNs. Some models predicted chord functions [42] andother models match chord accompaniments for a given melody [43]. Regarding the generation of accompaniment withdifferent tracks, there have been proposed GAN-based models which implement lead sheet arrangements. In 2018, aMulti-Instrument Co-Arrangement model called MICA [44] and its improvement MSMICA in 2020 [45] were proposedto generate multi-track accompaniment. There is also a model called the Bach Doodle [46] which used Coconet [47]to generate accompaniments for a given melody in Bach style. The harmonic quality of these models improves theharmony generated by models that create polyphonic melodies from scratch because the model focus on the melodiccontent to perform the harmonization which represents a smaller challenge than generating an entire well-harmonizedpiece from scratch.

There are more tasks inside music generation with DL using conditioning, such as generating a melody given a chordprogression, which is a way of composing that humans follow. This tasks have been addressed with VariationalAutoEncoders (VAEs) [48], Generative Adversarial Networks or GAN-based models [49], [50]4 and end-to-end models[44]. Other models perform the full process of composing, such as ChordAL [51]. This model generates chords andthen the obtained chord progression is sent to a melody generator and the final output is sent to a music style processor.

4https://shunithaviv.github.io/bebopnet/, accessed August 2021

7

Page 8: arXiv:2108.12290v2 [cs.SD] 7 Sep 2021

Models like BebopNet [50] generate a melody from jazz chords because this style presents additional challenges in theharmony context.

3.4 Genre Transformation with Style Transfer

In music, a style or genre is defined as a complex mixture of features ranging from music theory to sound design. Thesefeatures include timbre, the composition process and the instruments used in the music piece or the effects with whichmusic is synthesized. Due to the fact that there are lots of music genres and the lack of datasets for some of those genres,it is common to use style transfer techniques to transform music in a determined style into other style by changing thepitch of existing notes or adding new instruments that fit in the style into which we want to transform the music.

In computer-based music composition, the most common technique for performing style transfer in music is to obtainan embedding of the style and the use this embedding or feature vector to generate new music. Style transfer [52] inNNs was introduced in 2016 by Gatys et al. with the idea of applying style features to an image from another image.One of the first studies that used style transfer for symbolic music generation was MIDI-VAE [53] in 2018. MIDI-VAEencodes the style in the latent space as a combination of pitch, dynamics and instrument features to generate polyphonicmusic. Style transfer can also be achieved with Transfer Learning [14]. The first work that used transfer learning toperform style transfer was a recurrent VAE model for jazz which was proposed by Hung et al. [54] in 2019. Transferlearning is done by training the model on the source dataset, and then fine-tuning the resulting model parameters on thetarget dataset that can be in a different style than the source dataset. This model showed that using Transfer Learningto transform a music piece in a determined style to another is a great solution because it could not only be used totransform existing pieces in new genres but it could also be used to compose music from scratch in genres that arenot present in the music composition datasets that are being used nowadays. An example of this could be the use aNN trained with a large dataset for pop such as the Lakh MIDI Dataset (LMD) [33] and use this pre-trained model togenerate urban music though Transfer Learning.

Other music features such as harmony and texture (see fig. 2) have been also used as style transfer features [55], [36],[56]. There have also been studied fusion genres models in which different styles are mixed to generate music in anunknown style [57].

4 Instrumentation and Orchestration

As we mentioned in section 2, Instrumentation and orchestration are fundamental elements in the musical genre beingcomposed, and may represent a characteristic signature of each composer by the use of specific instruments or the waytheir compositions are orchestrated. An example of that is the orchestration that Beethoven used in his symphoniesthat changed the way in which music was composed [58]. Instrumentation is the study of how to combine similar ordifferent instruments in varying numbers in order to create an ensemble Orchestration is the selection and combinationof similarly or differently scored sections [59]. From that, we can relate Instrumentation as the color of a piece andOrchestration to the aesthetic aspect of the composition. Instrumentation and Orchestration have a huge impact on theway we perceive music and so, to the emotional part of music, but, although they represent a fundamental part of themusic, emotions are beyond the scope of this work.

4.1 From Polyphonic to Multi-Instrument Music Generation

In computer-based music composition we can group Instrumentation and Orchestration concepts in multi-instrumentor multi-track music. However, the DL-based models for multi-instrument generation do not work exactly with thoseconcepts. Multi-instrument DL-based models generate polyphonic music for more than one instrument but, doesthe generated music follow a coherent harmonic progression? Is the resulting arrangement coherent in terms ofInstrumentation and Orchestration or do DL-based models just generate multi-instrument music not taking into accountthe color of each instrument or the arrangement? In section 3 we showed that polyphonic music generation couldcompose music with a certain harmonic sense, but when facing multi-instrument music one of the most importantaspects to take into account is the color of the instruments and the ensemble. Deciding how many and which instrumentsare in the ensemble, and how to divide the melody and accompaniment between them is not yet a solved problem inMusic Generation with DL. In the recent years this challenges have been faced by building DL models that generatemusic from scratch that can be interactive models in which humans can select the instruments of the ensemble [28].There are also models that allows to inpaint instruments or bars. We describe these models and answer to the exposedquestions in section 4.2. In Fig. 4 we show the scheme with the music basic principles of an output-like score of amulti-instrument generation model.

8

Page 9: arXiv:2108.12290v2 [cs.SD] 7 Sep 2021

Figure 4: Scheme of an output-like score of multi-instrument generation models.

4.2 Multi-Instrument Generation from Scratch

The first models that could generate multi-track music have been proposed rencently. Previous to multi-track musicgeneration, some models generated the drums track for a given melody or chords. An example of those models are themodel proposed in 2012 by Kang et al. [60] which accompanied a melody in a given scale with an automated drumsgenerator. Later on, in 2017 Chu et al. [61] used a hierarchical RNN to generate pop music in which drums werepresent.

One of the most used architectures in music generation are the generative models such as GANs and VAEs. Thefirst considered and most well-known model for multi-track music generation is MuseGAN [62], presented in 2017.Then, more models followed the multi-instrument generation task [63], [64] and later in 2020 other models basedon AutoEncoders such as MusAE [65] were released. The other big group of NN architectures that have being usedrecently to generate music are the Transformers. The most well-known model for music generation with Transformersis the Music Transformer [24] for piano polyphonic music generation. In 2019, Donahue et al. [66] proposed LakhNESfor multi-track music generation and in 2020, Ens et al. proposed a Conditional Multi-Track Music Generation model(MMM) [28] which is based on LakhNES and improves the token representation of this previous model by concatenatingmultiple tracks into a single sequence. This model uses a MultiInstrument and a BarFill representations which arerepresented in Fig. 5. In Fig. 5 we show the MultiInstrument representation which contains the tokens that the MMMmodel uses for generating music, and the BarFill representation that is used to inpaint, that is, generating a bar or acouple of bars but maintaining the rest of the composition.

From the composition process point of view, these models do not orchestrate or instrumentate, but they create musicfrom scratch or by inpainting. This means that these models do not choose the number of instruments and do notgenerate high-quality melodic or accompaniment content related to the instrument that is being selected. As an example,the MMM model generates melodic content for a predefined instrument which follows the timbral features of theinstrument, but when inpainting or recreating a single instruments while keeping the other tracks, it is sometimesdifficult to follow the key in which the other instruments are composed. This leads us to the conclusion that multi-instrumental models for music generation focus on end-to-end generation, but still do not work well when it comes toinstrumentation or orchestration, as they still cannot decide the number of instruments in the generated piece of music.They generate music for the ensemble they were trained on, such as LakhNES [66] or they take predefined tracks togenerate the content of each track [28]. More recent models, such as MMM, are opening up interactivity betweenhuman and artificial intelligence in terms of multi-instrument generation, which will allow better tracking of the humancompositional process and thus improve the music generated with multiple instruments.

9

Page 10: arXiv:2108.12290v2 [cs.SD] 7 Sep 2021

Figure 5: MMM token representations reproduced from [28]

5 Evaluation and Metrics

Evaluation in music generation can be divided according to the way in which the output of a DL mode is measured. Jiet al. [67] differentiate between the evaluation from the objective point of view and from the subjective point of view.In music, it is necessary to measure the results from a subjective point of view because it is the type of evaluation thattells us how much creativity the model brings compared to human creativity. Sometimes, the objective evaluation thatcalculates the metrics of the results of a model can give us an idea of the quality of these results, but it is difficult tofind a way to relate it to the concept of creativity. In this section we show how the state-of-the-art models measure thequality of their results from objective and subjective points of view.

5.1 Objective Evaluation

The objective evaluation measures the performance of the model and the quality of its outputs using some numericalmetrics. In music generation there is the problem of comparing models trained for different purposes and models trainedwith different datasets, so we give a description of the most common metrics used in state-of-the-art models. Ji et al.[67] differentiate between model metrics and music metrics or descriptive statistics, and other methods such as patternrepetition [68] or plagiarism detection [9].

When you want to measure the performance of a model, the most used metrics depending on the DL model used togenerate music are: the loss, the perplexity, the BLEU score, the precision (P), recall (R) or F-score (F1). Normally,these metrics are used to compare different DL models that are built for the same purpose.

Loss is usually used to show the difference between the inputs and outputs of the model from a mathematical point ofview, while, on the other hand, perplexity tells us the generalization capability that a model has, which is more relatedto how the model generates new music. As an example, the Music Transformer [24] use the loss and the perplexity tocompare the outputs between different Transformer architectures in order to validate the model, TonicNet [69] onlyuses the loss for the same purpose and MusicVAE [17] only uses a measure that indicates the quality of reconstructionthat the model has, but does not use any metric to compare between other DL music generation models.

Regarding metrics that are specifically related to music, that is, those that take into account musical descriptors, wecan find that these metrics help to measure the quality of a composition. According to Ji et al. [67] these metrics canbe grouped in four categories: pitch-related, rhythm-related, harmony-related and style transfer-related. Pitch-relatedmetrics [67] such as scale consistency, tone spam, ratio of empty bars or number of pitch classes used, are metrics thatmeasures pitch attributes in general. Rhythm-related metrics take into account the duration or pattern of the notes andare, for example, the rhythm variations, the number of concurrent three or four notes or the duration of repeated pitches.The harmony-related metrics measure the chords entropy, distance or coverage. These three metric categories are usedby models like MuseGAN [62], C-RNN-GAN [70] or JazzGAN [49]. Finally, techniques related to style transfer helpto understand how close or far the generation is from the desired style. These include, among others, the style fit, thecontent preservation or the strength of transfer [71].

5.2 Subjective Evaluation

The subjective point of view determines how the generated music is in terms of creativity and novelty, that is, to whatextent the music generated can be considered art. There is no a way to define art, although art involves creativity and

10

Page 11: arXiv:2108.12290v2 [cs.SD] 7 Sep 2021

aesthetics. Sternberg and Kaufman [72] defined creativity as the ability to make contributions that are both noveland appropriate to the task, often with an added component such as being quality, surprising, or useful. Creativityrequires a deeper understanding of the nature and use of musical knowledge. According to Ji et al. [67] there is a lackof correlation between the quantitative evaluation of music quality and human judgement, which means that musicgeneration models must be also evaluated from a subjective point of view which would give us insights of the creativityof the model. The most used method in subjective evaluations is the listening test which often consists on humanstrying to differentiate between machine-generated or human-created music. This method is known as the Turing testwhich is performed to test DeepBach [9]. In this model, 1.272 people belonging to different musical experience groupstook the test. This test showed that the more complex the model was, the better outputs it got. MusicVAE [17] alsoperform a listening test and the Kruskal Wallis H-Test to validate the quality of the model, which conclude that themodel performed better with the hierarchical decoder. MuseGAN [62] also conducted a listening test with 144 usersdivided into groups with different musical experience, but there were pre-defined questions that the users had to vote ina range from 1 to 5: the harmony complaisance, the rhythm, the structure, the coherence and the overall rating.

Other listening methods require scoring the generated music, this is known as side-by-side rating [67]. It is also possibleto ask some questions to the listeners about the creativity of the model or the naturalness of the generated piece, amongother questions, depending on the generation goal of the model. One important thing to keep in mind in listening testsis the variability of the population that is chosen for the test (if listeners are music students with a basic knowledge ofmusic theory, if they are amateurs and so they do not have any music knowledge or if they are professional musicians).The listeners must have the same stimuli and also listen to the same pieces and have as reference (if it applicable) thesame human-created pieces. Auditory fatigue must also be taken into account as there can be an induced bias in thelisteners if the listen to similar samples for a long period of time.

Being said that, we can conclude that listening tests are indispensable when it comes to music generation because itgives a feedback of the quality of the model and they can also be a way to find the better NN architecture or DL modelthat is being studied.

6 Discussion

We have showed that music is a structured language model with temporal and harmonic short and long-term relationshipswhich requires a deep understanding of all of its insights in order to be modelled. This, in addition to the variety ofgenres and subgenres that exist in music and the large number of composing strategies that can be followed to compose amusic piece, makes the field of music generation with DL a constantly growing and challenging field. Having describedthe music composition process and recent work in DL for music generation, we will now address the issues raised inthe section 1.3.

Are the current DL models capable of generating music with a certain level of creativity?

The first models that generated music with DL used RNNs such as LSTMs. These models could generate notes but theyfailed when they generated long-term sequences. This was due to the fact that these NNs did not handle the long-termsequences that are required for music generation. In order to solve this problem and being able to generate short motifsby interpolating two existing motifs or sampling from a distribution, MusicVAE was created. But some questions arisefrom here: Do interpolations between existing motifs generate high quality ones which have sense inside the samemusic piece? If we use MusicVAE for creating a short motif we could get very good results, but if we use this kind ofmodels to generate longer phrases or motifs that are similar to the inputs, these interpolations may output motifs thatcan be aesthetic but sometimes they do not follow any rhythmic or notes direction (ascendent or descendent) pattern thatthe inputs have. Therefore, these interpolations usually cannot generate high quality motifs because the model does notunderstand the rhythmic patterns and notes directions. In addition, chord progressions normally do have inversions andthere are rules in classical music or stylistic restrictions in pop, jazz or urban music that determine how each chord isfollowed by another chord. If we analyze the generated polyphonic melodies of DL methods there is a lack of quality interms of harmonic content, because NNs which are trained to generate music could not understand all these intricaciesthat are present in the music language or because this information should be passed to the NN as part of the input forexample as tokens.

What is the best NN architecture to perform music composition with DL?

The Transformer architecture has been used with different attention mechanisms that allow longer sequences to bemodeled. An example of this is the success of the MMM model [28] that used GPT-2 to generate multi-track music.Despite the fact that the model used a pre-trained Transformer for text generation, it generates coherent music in termsof harmony and rhythm. Other architectures use Generative Networks such as GANs or VAEs, and also a combination

11

Page 12: arXiv:2108.12290v2 [cs.SD] 7 Sep 2021

of these architectures with Transformers. The power that bring these models is the possibility to extract high-levelmusic attributes such as the style, and low-level features that are organized in a latent space. This latent space is thenused to interpolate between those features and attributes to generate new music based on existing pieces and musicstyles.

Analyzing the NN models and architectures that have been used in the past years to generate music with DL, there isnot a specific NN architecture that performs better for this purpose because the best NN architecture that can be usedto build a music generation model will depend on the output that we want to obtain. In spite of that, Transformersand Generative models are are emerging as the best alternatives in this moment as the latest works in this domaindemonstrates. A combination of both models is also a great option to perform music generation [35] although it dependson the output we want to generate, and sometimes the best solution comes from a combination of DL with probabilisticmethods. Another aspect to take into account is that generally, music generation requires models with a large numberof parameters and data. We can solve this problem by taking a pre-trained model as some state-of-the-art modelswhich we have described in the previous sections, and then perform fine-tuning to another NN architecture. Anotheroption is having a pre-trained latent space which has been generated by training a big model with a huge dataset likeMusicVAE proposes, and then train a smaller NN with less data taking advantage of the pre-trained latent space in orderto condition the style of the music compositions as MidiMe proposes [73].

Could end-to-end methods generate entire structured music pieces?

As we described in section 3.2, nowadays there are structure template based models that can generate structured music[39], but there is not yet an end-to-end method that can compose a structured music piece. The music compositionprocess that a human composer follows is similar to this template-based method. In the near future, it would be likelythat AI could compose structured music from scratch, but the question here is whether AI models for music generationwill be used to compose entire music pieces from scratch or whether these models would be more useful as an aid tocomposers and thus as an interaction between humans and AI.

Are the composed pieces with DL just an imitation of the inputs or can NNs generate new music in styles thatare not present in the training data?

When training DL models, some information that is in the input which is passed to the NN can be present without anymodification in the output. Even that, MusicVAE [17] and other DL models for music generation showed that newmusic can be composed without imitating existing music or committing plagiarism. Imitate the inputs could be a caseof overfitting which is never the goal of a DL model. It should also be taken into account that it is very difficult tocommit plagiarism in the generation of music due to the great variety of instruments, tones, rhythms or chords that maybe present in a piece of music.

Should NNs compose music by following the same logic and process as humans do?

We showed that researchers started to build models that could generate polyphonic melodies but these melodies did notfollow any direction after a few bars. When MusicVAE came out, it was possible to generate high quality motifs andthis encouraged new studies to generate melodies taking information of past time steps. New models such as DiffusionModels [37] are using this pre-trained models to generate longer sequences to let melodies follow patterns or directions.We also show that there are models that can generate a melody by conditioning it with chord progressions which isthe way of composing music in styles like pop. Comparing the human way of composing to the DL architectures thatare being used to generate music, we can see some similarities of both processes specially in auto-regressive models.Auto-regression (AR) consists on predict the future values from past events. Some DL methods are auto-regressive, andthe fact that new models are trying to generate longer sequences by taking information of past time steps resemble thehuman composing process of classical music.

How much data do DL models for music generation need?

This question can be answered partially if we look at the state-of-the-art models. MusicVAE uses the LMD [33] with3.7 million melodies, 4.6 million drum patterns and 116 thousand trios. Music Transformer instead uses only 1.100piano pieces from the Piano-e-Competition to train the model. Other models such as the MMM takes the GPT-2 whichis a pre-trained Transformer with lots of text data. This lead us to affirm that DL models for music generation do needlots of data, specially when trainig Generative Models or Transformers, but taking pre-trained models and performtransfer learning is also a good solution specially for music genres and subgenres that are not represented in the actualdatasets for symbolic music generation.

12

Page 13: arXiv:2108.12290v2 [cs.SD] 7 Sep 2021

Are current evaluation methods good enough to compare and measure the creativity of the composed music?

As we described in section 5, there are two evaluation categories: the objective and the subjective evaluation. Objectiveevaluation metrics are similar between existing methods but there is a lack of a general subjective evaluation method.The listening tests are the most used subjective evaluation method but sometimes the Turing test which just asks todistinguish between a computer-based or a human composition is not enough to know all the characteristics of thecompositions created by a NN. The solution to this problem would be to ask general questions related to the quality ofthe music features showed in fig. 2 as MuseGAN proposes, and use the same questions and the same rating method inthe DL models to set a general subjective evaluation method.

7 Conclusions and Future Work

In this paper, we have described the state-of-the-art in DL music generation by giving an overview of the NNarchitectures that have been used for DL music generation, and discussed the challenges that are still open in the use ofdeep NNs in music generation.

The use of DL architectures and techniques for the generation of music (as well as other artistic content) is a growingarea of research. However, there are open challenges such as generating music with structure, analyzing the creativityof the generated music and build interactivity models that could help composers. Future work should focus on bettermodeling the long-term relationships (in time and harmony axes) in order to generate well-structured and harmonizedmusic that does not get loose after a few bars, and inpainting or human-AI interaction which is a task with a growinginterest in the recent years. There is also a pending challenge that has to do with transfer learning or the conditioningof the generation of styles that allows not to be restricted only to the same authors and genres that are present in thepublicly available datasets, such as the JSB Chorales Dataset or the Lakh MIDI Dataset, which makes most of thestate-of-the-art works only focus on the same music styles. When it comes to multi-instrument generation, this taskdoes not follow the human composing process and it could be interesting to see new DL models that first composea high-quality melodic content and then decide by themselves or with human’s help the number of instruments ofthe music piece and be able to write high-quality music for each instrument attending to its timbral features. Furtherquestions related to the directions that music generation with DL should focus on, that is, building end-to-end modelsthat can generate high-creative music from scratch or interactive models in which composers could interact with the AIis a task that future will solve, although the trending of human-AI interaction is growing faster everyday.

There are more open questions in music composition with DL that are not in the scope of this paper. Questions likewho owns the intellectual property of music generated with DL if the NNs are trained with copyrighted music. Wesuggest that this would be an important key in commercial applications. The main key here is to define what makes acomposition different from others and there are several features that play an important role here. As we mentioned inthe section 1, these features include the composition itself, but also the timbre and effects used in creating the soundof the instruments. From the point of view of composition, which is the scope of our study, we can state that whengenerating music with DL it is always likely to generate music that is similar to the inputs, and sometimes the musicgenerated has patterns taken directly from the inputs, so further research would have to be done in this area from thepoint of view of music theory, intellectual property and science to define what makes a composition different fromothers and how music generated with DL could be registered.

We hope that the analysis presented in this article will help to better understand the problems and possible solutions andthus may contribute to the overall research agenda of deep learning-based music generation.

Acknowledgments.

This research has been partially supported by the Spanish Science, Innovation and University Ministry by the RTI2018-096986-B-C31 contract and the Aragonese Government by the AffectiveLab-T60-20R project.

We wish to thank Jürgen Schmidhuber for his suggestions.

References

[1] https://www.copyright.gov/prereg/music.html, 2019. accessed July 2021.

[2] Noam Chomsky. Syntactic structures. De Gruyter Mouton, 2009.

[3] David Cope. Experiments in musical intelligence (emi): Non-linear linguistic-based composition. Journal of NewMusic Research, 18(1-2):117–139, 1989.

13

Page 14: arXiv:2108.12290v2 [cs.SD] 7 Sep 2021

[4] Iannis Xenakis. Musiques formelles nouveaux principes formels de composition musicale. 1981.

[5] https://koenigproject.nl/project-1/, 2019. accessed July 2021.

[6] Peter M. Todd. A connectionist approach to algorithmic composition. Computer Music Journal, 13(4):27–43,1989.

[7] Gerhard Nierhaus. Algorithmic composition: paradigms of automated music generation. Springer Science &Business Media, 2009.

[8] Lejaren A Hiller Jr and Leonard M Isaacson. Musical composition with a high speed digital computer. In AudioEngineering Society Convention 9. Audio Engineering Society, 1957.

[9] Gaëtan Hadjeres, François Pachet, and Frank Nielsen. Deepbach: a steerable model for bach chorales generation.In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on MachineLearning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine LearningResearch, pages 1362–1371. PMLR, 2017.

[10] Michael C. Mozer. Neural network music composition by prediction: Exploring the benefits of psychoacousticconstraints and multi-scale processing. Connect. Sci., 6(2-3):247–280, 1994.

[11] Douglas Eck. A network of relaxation oscillators that finds downbeats in rhythms. In Georg Dorffner, HorstBischof, and Kurt Hornik, editors, Artificial Neural Networks - ICANN 2001, International Conference Vienna,Austria, August 21-25, 2001 Proceedings, volume 2130 of Lecture Notes in Computer Science, pages 1239–1247.Springer, 2001.

[12] Douglas Eck and Jürgen Schmidhuber. Learning the long-term structure of the blues. In José R. Dorronsoro,editor, Artificial Neural Networks - ICANN 2002, International Conference, Madrid, Spain, August 28-30, 2002,Proceedings, volume 2415 of Lecture Notes in Computer Science, pages 284–289. Springer, 2002.

[13] Jamshed J Bharucha and Peter M Todd. Modeling the perception of tonal structure with neural nets. ComputerMusic Journal, 13(4):44–53, 1989.

[14] Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. Acomprehensive survey on transfer learning. Proc. IEEE, 109(1):43–76, 2021.

[15] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengio and Yann LeCun,editors, 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16,2014, Conference Track Proceedings, 2014.

[16] Irina Higgins, Loïc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed,and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework.In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017,Conference Track Proceedings. OpenReview.net, 2017.

[17] Adam Roberts, Jesse H. Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A hierarchical latent vectormodel for learning long-term structure in music. In Jennifer G. Dy and Andreas Krause, editors, Proceedings ofthe 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden,July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 4361–4370. PMLR, 2018.

[18] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative adversarial nets. NIPS’14, page 2672–2680, Cambridge, MA, USA,2014. MIT Press.

[19] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain.Psychological review, 65(6):386, 1958.

[20] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.

[21] Kunihiko Fukushima and Sei Miyake. Neocognitron: A new algorithm for pattern recognition tolerant ofdeformations and shifts in position. Pattern Recognit., 15(6):455–469, 1982.

[22] Yann LeCun, Patrick Haffner, Léon Bottou, and Yoshua Bengio. Object recognition with gradient-based learning.In David A. Forsyth, Joseph L. Mundy, Vito Di Gesù, and Roberto Cipolla, editors, Shape, Contour and Groupingin Computer Vision, volume 1681 of Lecture Notes in Computer Science, page 319. Springer, 1999.

[23] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M.Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural InformationProcessing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017,Long Beach, CA, USA, pages 5998–6008, 2017.

14

Page 15: arXiv:2108.12290v2 [cs.SD] 7 Sep 2021

[24] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Curtis Hawthorne, Andrew M Dai,Matthew D Hoffman, and Douglas Eck. Music transformer: Generating music with long-term structure. arXivpreprint arXiv:1809.04281, 2018.

[25] Raymond Glenn Levi. A field investigation of the composing processes used by second-grade children creatingoriginal language and music pieces. PhD thesis, Case Western Reserve University, 1991.

[26] David Collins. A synthesis process model of creative thinking in music composition. Psychology of music,33(2):193–216, 2005.

[27] Charles W Walton. Basic Forms in Music. Alfred Music, 2005.[28] Jeff Ens and Philippe Pasquier. Mmm: Exploring conditional multi-track music generation with the transformer.

arXiv preprint arXiv:2008.06048, 2020.[29] Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In Geoffrey J. Gordon,

David B. Dunson, and Miroslav Dudík, editors, Proceedings of the Fourteenth International Conference onArtificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, volume 15 of JMLRProceedings, pages 29–37. JMLR.org, 2011.

[30] Mason Bretan, Gil Weinberg, and Larry P. Heck. A unit selection methodology for music generation usingdeep neural networks. In Ashok K. Goel, Anna Jordanous, and Alison Pease, editors, Proceedings of the EighthInternational Conference on Computational Creativity, ICCC 2017, Atlanta, Georgia, USA, June 19-23, 2017,pages 72–79. Association for Computational Creativity (ACC), 2017.

[31] Elliot Waite et al. Generating long-term structure in songs and stories. Web blog post. Magenta, 15(4), 2016.[32] Gaëtan Hadjeres and Frank Nielsen. Interactive music generation with positional constraints using anticipation-

rnns. CoRR, abs/1709.06404, 2017.[33] Colin Raffel. Learning-based methods for comparing sequences, with applications to audio-to-midi alignment

and matching. PhD thesis, Columbia University, 2016.[34] Christine Payne. Musenet, 2019. URL https://openai. com/blog/musenet, 2019.[35] Junyan Jiang, Gus Xia, Dave B. Carlton, Chris N. Anderson, and Ryan H. Miyakawa. Transformer VAE: A

hierarchical model for structure-aware and interpretable music representation learning. In 2020 IEEE InternationalConference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020, pages516–520. IEEE, 2020.

[36] Ziyu Wang, Yiyi Zhang, Yixiao Zhang, Junyan Jiang, Ruihan Yang, Junbo Zhao, and Gus Xia. PIANOTREEVAE: structured representation learning for polyphonic music. CoRR, abs/2008.07118, 2020.

[37] Gautam Mittal, Jesse H. Engel, Curtis Hawthorne, and Ian Simon. Symbolic music generation with diffusionmodels. CoRR, abs/2103.16091, 2021.

[38] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Hugo Larochelle,Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in NeuralInformation Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS2020, December 6-12, 2020, virtual, 2020.

[39] Stefan Lattner, Maarten Grachten, and Gerhard Widmer. Imposing higher-level structure in polyphonic musicgeneration using convolutional restricted boltzmann machines and constraints. CoRR, abs/1612.04742, 2016.

[40] Meinard Müller. Fundamentals of Music Processing - Audio, Analysis, Algorithms, Applications. Springer, 2015.[41] Ke Chen, Weilin Zhang, Shlomo Dubnov, Gus Xia, and Wei Li. The effect of explicit structure encoding of

deep neural networks for symbolic music generation. In 2019 International Workshop on Multilayer MusicRepresentation and Processing (MMRP), pages 77–84. IEEE, 2019.

[42] Yin-Cheng Yeh, Wen-Yi Hsiao, Satoru Fukayama, Tetsuro Kitahara, Benjamin Genchel, Hao-Min Liu, Hao-WenDong, Yian Chen, Terence Leong, and Yi-Hsuan Yang. Automatic melody harmonization with triad chords: Acomparative study. CoRR, abs/2001.02360, 2020.

[43] Wei Yang, Ping Sun, Yi Zhang, and Ying Zhang. Clstms: A combination of two lstm models to generate chordsaccompaniment for symbolic melody. In 2019 International Conference on High Performance Big Data andIntelligent Systems (HPBD&IS), pages 176–180. IEEE, 2019.

[44] Hongyuan Zhu, Qi Liu, Nicholas Jing Yuan, Chuan Qin, Jiawei Li, Kun Zhang, Guang Zhou, Furu Wei, YuanchunXu, and Enhong Chen. Xiaoice band: A melody and arrangement generation framework for pop music. In YikeGuo and Faisal Farooq, editors, Proceedings of the 24th ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, pages 2837–2846. ACM, 2018.

15

Page 16: arXiv:2108.12290v2 [cs.SD] 7 Sep 2021

[45] Hongyuan Zhu, Qi Liu, Nicholas Jing Yuan, Kun Zhang, Guang Zhou, and Enhong Chen. Pop music generation:From melody to multi-style arrangement. ACM Trans. Knowl. Discov. Data, 14(5):54:1–54:31, 2020.

[46] Cheng-Zhi Anna Huang, Curtis Hawthorne, Adam Roberts, Monica Dinculescu, James Wexler, Leon Hong, andJacob Howcroft. Approachable music composition with machine learning at scale. In Arthur Flexer, GeoffroyPeeters, Julián Urbano, and Anja Volk, editors, Proceedings of the 20th International Society for Music InformationRetrieval Conference, ISMIR 2019, Delft, The Netherlands, November 4-8, 2019, pages 793–800, 2019.

[47] Cheng-Zhi Anna Huang, Tim Cooijmans, Adam Roberts, Aaron C. Courville, and Douglas Eck. Counterpoint byconvolution. In Sally Jo Cunningham, Zhiyao Duan, Xiao Hu, and Douglas Turnbull, editors, Proceedings of the18th International Society for Music Information Retrieval Conference, ISMIR 2017, Suzhou, China, October23-27, 2017, pages 211–218, 2017.

[48] Yifei Teng, Anny Zhao, and Camille Goudeseune. Generating nontrivial melodies for music as a service. InSally Jo Cunningham, Zhiyao Duan, Xiao Hu, and Douglas Turnbull, editors, Proceedings of the 18th InternationalSociety for Music Information Retrieval Conference, ISMIR 2017, Suzhou, China, October 23-27, 2017, pages657–663, 2017.

[49] Nicholas Trieu and R Keller. Jazzgan: Improvising with generative adversarial networks. In MUME workshop,2018.

[50] Shunit Haviv Hakimi, Nadav Bhonker, and Ran El-Yaniv. Bebopnet: Deep neural models for personalized jazzimprovisations. In Proceedings of the 21st international society for music information retrieval conference, ismir,2020.

[51] Hao Hao Tan. Chordal: A chord-based approach for music generation using bi-lstms. In Kazjon Grace,Michael Cook, Dan Ventura, and Mary Lou Maher, editors, Proceedings of the Tenth International Conferenceon Computational Creativity, ICCC 2019, Charlotte, North Carolina, USA, June 17-21, 2019, pages 364–365.Association for Computational Creativity (ACC), 2019.

[52] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks.In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June27-30, 2016, pages 2414–2423. IEEE Computer Society, 2016.

[53] Gino Brunner, Andres Konrad, Yuyi Wang, and Roger Wattenhofer. MIDI-VAE: modeling dynamics andinstrumentation of music with applications to style transfer. In Emilia Gómez, Xiao Hu, Eric Humphrey, andEmmanouil Benetos, editors, Proceedings of the 19th International Society for Music Information RetrievalConference, ISMIR 2018, Paris, France, September 23-27, 2018, pages 747–754, 2018.

[54] Hsiao-Tzu Hung, Chung-Yang Wang, Yi-Hsuan Yang, and Hsin-Min Wang. Improving automatic jazz melodygeneration by transfer learning techniques. In 2019 Asia-Pacific Signal and Information Processing AssociationAnnual Summit and Conference, APSIPA ASC 2019, Lanzhou, China, November 18-21, 2019, pages 339–346.IEEE, 2019.

[55] Ziyu Wang, Dingsu Wang, Yixiao Zhang, and Gus Xia. Learning interpretable representation for controllablepolyphonic music generation. CoRR, abs/2008.07122, 2020.

[56] Shih-Lun Wu and Yi-Hsuan Yang. Musemorphose: Full-song and fine-grained music style transfer with just onetransformer VAE. CoRR, abs/2105.04090, 2021.

[57] Zhiqian Chen, Chih-Wei Wu, Yen-Cheng Lu, Alexander Lerch, and Chang-Tien Lu. Learning to fuse musicgenres with generative adversarial dual learning. In Vijay Raghavan, Srinivas Aluru, George Karypis, Lucio Miele,and Xindong Wu, editors, 2017 IEEE International Conference on Data Mining, ICDM 2017, New Orleans, LA,USA, November 18-21, 2017, pages 817–822. IEEE Computer Society, 2017.

[58] George Grove. Beethoven and his nine symphonies, volume 334. Courier Corporation, 1962.

[59] Ertugrul Sevsay. The cambridge guide to orchestration. Cambridge University Press, 2013.

[60] Soo-Yol Ok Semin Kang and Young-Min Kang. Automatic music generation and machine learning basedevaluation. In International Conference on Multimedia and Signal Processing, pages 436–443, 2012.

[61] Hang Chu, Raquel Urtasun, and Sanja Fidler. Song from PI: A musically plausible network for pop musicgeneration. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April24-26, 2017, Workshop Track Proceedings. OpenReview.net, 2017.

[62] Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, and Yi-Hsuan Yang. Musegan: Multi-track sequential generativeadversarial networks for symbolic music generation and accompaniment. In Sheila A. McIlraith and Kilian Q.Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational

16

Page 17: arXiv:2108.12290v2 [cs.SD] 7 Sep 2021

Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 34–41.AAAI Press, 2018.

[63] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policygradient. In Satinder P. Singh and Shaul Markovitch, editors, Proceedings of the Thirty-First AAAI Conference onArtificial Intelligence, February 4-9, 2017, San Francisco, California, USA, pages 2852–2858. AAAI Press, 2017.

[64] Hao-Wen Dong and Yi-Hsuan Yang. Convolutional generative adversarial networks with binary neurons forpolyphonic music generation. In Emilia Gómez, Xiao Hu, Eric Humphrey, and Emmanouil Benetos, editors,Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris,France, September 23-27, 2018, pages 190–196, 2018.

[65] Andrea Valenti, Antonio Carta, and Davide Bacciu. Learning a latent space of style-aware symbolic musicrepresentations by adversarial autoencoders. CoRR, abs/2001.05494, 2020.

[66] Chris Donahue, Huanru Henry Mao, Yiting Ethan Li, Garrison W. Cottrell, and Julian J. McAuley. Lakhnes:Improving multi-instrumental music generation with cross-domain pre-training. In Arthur Flexer, Geoffroy Peeters,Julián Urbano, and Anja Volk, editors, Proceedings of the 20th International Society for Music InformationRetrieval Conference, ISMIR 2019, Delft, The Netherlands, November 4-8, 2019, pages 685–692, 2019.

[67] Shulei Ji, Jing Luo, and Xinyu Yang. A comprehensive survey on deep music generation: Multi-level representa-tions, algorithms, evaluations, and future directions. CoRR, abs/2011.06801, 2020.

[68] Cheng-i Wang and Shlomo Dubnov. Guided music synthesis with variable markov oracle. In Philippe Pasquier,Arne Eigenfeldt, and Oliver Bown, editors, Musical Metacreation, Papers from the 2014 AIIDE Workshop, October4, 2014, Raleigh, NC, USA, volume WS-14-18 of AAAI Workshops. AAAI Press, 2014.

[69] Omar Peracha. Improving polyphonic music models with feature-rich encoding. CoRR, abs/1911.11775, 2019.[70] Olof Mogren. C-RNN-GAN: continuous recurrent neural networks with adversarial training. CoRR,

abs/1611.09904, 2016.[71] Gino Brunner, Yuyi Wang, Roger Wattenhofer, and Sumu Zhao. Symbolic music genre transfer with cyclegan. In

Lefteri H. Tsoukalas, Éric Grégoire, and Miltiadis Alamaniotis, editors, IEEE 30th International Conference onTools with Artificial Intelligence, ICTAI 2018, 5-7 November 2018, Volos, Greece, pages 786–793. IEEE, 2018.

[72] Robert J Sternberg and James C Kaufman. The nature of human creativity. Cambridge University Press, 2018.[73] Monica Dinculescu, Jesse Engel, and Adam Roberts, editors. MidiMe: Personalizing a MusicVAE model with

user data, 2019.

17