Visualization in Audio- Based Music Information Retrieval

Matthew Cooper,* Jonathan Foote,* EliasPampalk,† and George Tzanetakis††

*FX Palo Alto Laboratory3400 Hillview Avenue, Building 4Palo Alto, California 94304 USA{foote, cooper}@fxpal.com†Austrian Research Institute for ArtificialIntelligence (OFAI)Freyung 6/6 A-1010Vienna, [email protected]††Department of Computer Science (also Music)University of VictoriaPO Box 3055, STN CSCVictoria, British Columbia V8W 3P6 [email protected]

42 Computer Music Journal

Music information retrieval (MIR) is an emergingresearch area that explores how music stored digi-tally can be effectively organized, searched, re-trieved, and browsed. The explosive growth ofonline music distribution, portable music players,and lowering costs of recording indicate that in thenear future, most of the recorded music in humanhistory will be available digitally. MIR is steadilygrowing as a research area, as can be evidenced bythe international conference on music informationretrieval (ISMIR) series (soon in its sixth year) andthe increasing number of MIR-related publicationsin Computer Music Journal and other journals andconference proceedings.

Designing and developing visualization tools foreffectively interacting with large music collectionsis the main topic of this overview article. Connect-ing visual information with music and sound hasfascinated composers, artists, and painters for a longtime. Rapid advances in computer performancehave enabled a variety of creative endeavors to con-nect image and sound, ranging from simple directrenderings of spectrograms popular in software mu-sic players to elaborate real-time interactive sys-tems with three-dimensional graphics. Mostexisting tools and interfaces that use visual repre-sentations of audio/music such as audio editorstreat audio as a monolithic block of digital sampleswithout any information regarding its content. Thesystems described in this overview are character-

ized by the fact that they attempt to visually repre-sent higher-level information about the content ofmusic. MIR is a new field, and visualization for MIRis still in its infancy; therefore we believe that thisarticle provides a comprehensive overview of thecurrent state of the art in this area and will inspireother researchers to contribute new ideas.

Background

There has been considerable interest in making mu-sic visible. Many artists have attempted to realizethe images elicited by sound (Walt Disney’s Fanta-sia being an early, well-known example). Anotherapproach is to quantitatively render the time or fre-quency content of the audio signal, using methodssuch as the oscillograph and sound spectrograph(Koening, Dunn, and Lacey 1946; Potter, Kopp, andGreen 1947). These are intended primarily for scien-tific or quantitative analysis, although artists likeMary Ellen Bute have used quantitative methodssuch as the cathode ray oscilloscope toward artisticends (Moritz 1996). Other visualizations are derivedfrom note-based or score-representations of music,typically MIDI note events (Malinowski 1988;Smith and Williams 1997; Sapp 2001).

The idea of representing sound as a visual objectin a two- or three-dimensional space with propertiesrelated to the audio content originated in psychoa-coustics. By analyzing data collected from userstudies, it is possible to construct perceptual spacesthat visually show similarity relations between

Visualization in Audio-Based Music InformationRetrieval

Computer Music Journal, 30:2, pp. 42–62, Summer 2006© 2006 Massachusetts Institute of Technology.

Cooper et al.

single notes of different musical instruments (Grey1975). Using such a timbre space as control in com-puter music and performance was explored by Wes-sel (1979). This idea has been used in the IntuitiveSound Editing Environment (ISEE), in which nestedtwo- and three-dimensional visual spaces are usedto browse instrument sounds as experienced by mu-sicians using MIDI synthesizers and samples (Verte-gaal and Bonis 1994). The Sonic Browser is a tool foraccessing sounds or collections of sounds usingsound spatialization and context-overview visuali-zation techniques where each sound is representedas a visual object (Fernström and Brazil 2001). An-other approach is to visualize the low-level percep-tual processing of the human auditory system(Slaney 1997). An interesting visualization thatcombines traditional audio editing waveform repre-sentations and pitch-based placement of notes isused in the Melodyne software by Celemony (avail-able online at www.celemony.com/cms/).

The main goal of this article is to provide anoverview of visualization techniques developed inthe context of music information retrieval for repre-senting polyphonic audio signals. One of the defin-ing characteristics that differentiate the techniquesdescribed in this article from most previous work isthat the techniques described here use sophisticatedanalysis algorithms to automatically extract con-tent information from music stored in digital audioformat. The extracted information is then renderedvisually. Visualization techniques have been used inmany scientific domains (e.g., Spence 2001; Fayyad,Grinstein, and Wierse 2002); they tend to take ad-vantage of the strong pattern-recognition abilities ofthe human visual system to reveal similarities, pat-terns, and correlations in space and time. Visualiza-tion is more suited for areas that are exploratory innature and where there are large amounts of data tobe analyzed. MIR is a good example of such an area.The concept of browsing is also central to the de-sign of interfaces for MIR. Browsing is defined as“an exploratory, information seeking strategy thatdepends upon serendipity . . . especially appropriatefor ill-defined problems and exploring new task do-mains” (Marchionini 1995).

Techniques for visualizing music in the contextof MIR can be roughly divided into two major cate-

gories: techniques for visualizing a single file orpiece of music, and techniques for visualizing col-lections of pieces. The systems described in this ar-ticle are representative of the possibilities affordedby MIR visualization.

Parameterizing the Audio

The first step in any visualization of an audio sig-nal is to convert the audio to a parametric window-based feature and typically include Mel-FrequencyCepstral Coefficients (MFCCs), spectral featuresfrom the Short-Time Fourier Transform (STFT), orsubspace representations from principal compo-nent analysis. The window size may be varied,although robust analysis typically requires resolu-tion on the order of 20 Hz, that is, 20 windows persecond. For most visualization techniques, the ac-tual parameterization is not crucial as long as“similar” sounds yield similar parameters. Psycho-acoustically motivated parameterizations havealso been explored. Although this article occasion-ally describes specific parameterizations, it doesso only for completeness and clarity. All the de-scribed visualization techniques can use otherparameterizations.

In many of the visualization techniques describedherein, it is necessary to reduce the dimensionalityof the parameterization of the audio signal so thatthe information can be mapped for example to spa-tial dimensions or color. Principal componentsanalysis (PCA)—or more generally, the Karhunen-Loeve transform (Jolliffe 1986)—can be used for thispurpose. PCA is a dimensionality reduction tech-nique where a high-dimensional set of feature vec-tors is transformed to a set of feature vectors oflower dimensionality with minimum loss of infor-mation. The extraction of a principal componentamounts to a variance maximization rotation of theoriginal feature space. In other words, the first prin-cipal component is the axis passing through thecentroid of the feature vectors that has the maxi-mum variance and therefore explains a large partof the underlying structure of the feature space.The next principal component tries to maximizethe variance not explained by the first. In this

43

manner, consecutive orthogonal components areextracted.

Figure 1 (left) shows a scatter plot of two-dimensional data points. The data are concentratedin an ellipse that is indicated by a dotted line. PCArotates the original data axes to maximize the vari-ance accounted for by each dimension in the result-ing subspace. In practice, PCA is computed usingthe singular-value decomposition (SVD; Strang1988) of the rectangular data matrix. As shown inFigure 1 (right), projection of the data on to the ro-tated axis U1 accounts for more variance in the datathan projection onto the original axis X1. Here, U1 isthe direction that accounts for the most variance inthe original data and is the singular vector corre-sponding to the largest singular value. The axis U2

is the axis orthogonal to U1 that accounts for asmuch of the remaining variance in the original dataas possible. The subsequent axes in the low-dimensional subspace are calculated similarly.

More specifically, the principal components arelinear combinations of the original feature vectors vthat can be arranged as columns of a matrix V. Tocompute the principal components, we first calcu-late the feature vector covariance matrix C:

(1)C k lI

V k V k V l V li ii

( , ) ( ( ) ( ))( ( ) ( ))___ ___

= − −∑1

where V̄̄ = 1/I∑iVi, and its singular value decomposi-tion (SVD) is given by

(2)

Here, U and W are orthogonal matrices, and S is adiagonal matrix. The principal components of V arethe columns of U, and the corresponding singularvalues are contained in S.

To perform dimensionality reduction from ndimensions to m dimensions, where m < n, theprincipal components corresponding to the mlargest singular values are chosen. The collec-tion of pieces of music over which the covari-ance matrix C is calculated is important andprovides context-sensitivity for PCA-based visual-izations. For example, if the feature vectors fromonly the specific song to be analyzed are used forthe computation of the covariance matrix, the re-sulting PCA will reflect only the variance of thatparticular file. On the other hand, if a larger collec-tion of pieces of music is used for the computationof the covariance matrix, the resulting PCA willreflect the variances over the entire collection.Therefore, the same piece of music can have differ-ent PCA-based visualizations depending on whichfeature vectors are used to calculate the covariancematrix.

C U WT= �


Figure 1. The left depictstwo-dimensional data con-centrated in an ellipse. Theright shows the data ro-tated according to its twoprincipal axes, computedusing principal compo-nents analysis.

Cooper et al.

Visualizing a Single Musical Piece

Music is a complex human artifact, and there aremany possible ways one could try to describe it.The techniques described in this section attempt tovisually represent aspects of a piece of music, suchas structure, rhythm, self-similarity, and similarityto other pieces and styles.

Similarity Matrix

Musical pieces generally exhibit some degree of co-herence or similarity over their full durations. Withthe exception of more avant-garde compositions,structure and repetition are general features ofnearly all music. For example, the coda often re-sembles the introduction, and the second chorusgenerally sounds like the first. On a shorter timescale, successive bars are often repetitive, especiallyin popular music. The similarity matrix is a generalmethod for visualizing musical structure via itsacoustic self-similarity across time, rather than byabsolute acoustic characteristics.

Similarity-matrix analysis is a technique forstudying the global structure of time-ordered mediastreams (Foote and Cooper 2003). An audio file isvisualized as a square, as shown in Figure 2. Timeruns from left to right as well as from top to bottom.In the square matrix, the brightness of point (i, j) isproportional to the audio similarity between in-stants i and j in the source audio file. Similar re-gions are bright, dissimilar regions are dark. Thus,there is always a bright diagonal line running fromtop left to bottom right, because each audio instantis maximally similar to itself. Repetitive similari-ties, such as repeating notes or motifs, show up ascheckerboard patterns: a note that occurs twicewill give four bright areas at the corner of a square.The two regions at the off-diagonal corners are the“cross-terms” resulting from the first note’s simi-larity to the second. Repeated themes are visibleas diagonal lines parallel to—and separated from—the main diagonal by the time difference betweenrepetitions.

The similarity matrix contains the quantitativesimilarity between all pairwise combinations of au-

dio windows, and different audio parameterizationscan be used depending on the application. Repre-sent the B-dimensional feature vector computed forN windows of a digital audio file by the vectors{u1, . . . ,uN} ⊂ RB. Given a similarity measure d: RB xRB → R, the resulting similarity data is embeddedin a matrix S as illustrated in Figure 1. The ele-ments of S are

(3)

Throughout, S(i, j) denotes the element of the ithrow and jth column of the matrix S.

A common similarity measure is the cosine dis-tance. Given feature vectors vi and vj (representingwindows i and j, respectively), then

(4)

This measure is large if the vectors are similarly ori-ented in the feature space. Normalizing the innerproduct removes the dependence on magnitude (andhence energy, given spectral features). In practice,we typically zero-mean the data to give the highestrange to the distance measure. To build a non-

d v vv v

v vi ji j

i jcos( , )

,|| || || ||

=< >

S( , ) ( , )i j d v vi j=

45

Figure 2. Diagram of thesimilarity matrixembedding.

negative matrix, we also employ the following ex-ponential similarity measure:

(5)

To visualize an audio file, an image is constructedso that each pixel at location (i, j) is given a grayvalue proportional to the similarity measure de-scribed above.

We now review a visualization of a popular song.The piece analyzed is Magical Mystery Tour by TheBeatles. The 22-kHz digital audio file is divided intonon-overlapping 1,024-sample windows at 20 Hz.We calculate the 1,024-point magnitude spectrum

d v vd v vd v vi j

i j

i jexp

cos

cos

( , ) exp( , )( , )

=−+

11

and 45 MFCCs from each audio frame. The upper-left panel of Figure 3 shows the similarity matrixcomputed from the MFCC features using Equation5. The upper-right panel shows the similarity ma-trix computed from the seven MFCCs with thelargest variances. The seven coefficients are nor-malized to unit variance before calculating the sim-ilarity matrix using the similarity measure ofEquation 5. The upper-left panel of Figure 4 showsthe matrix computed from the full spectrogramdata. The upper-right panel shows the matrix com-puted using the spectrogram data projected into asubspace composed of its first seven principal com-ponents and scaled to unit variance. The bottompanels show visualizations of the manual segmenta-


Figure 3. Similarity matri-ces computed from 45MFCCs (top left) and theseven MFCCs with great-est variance (top right)

after low-pass filtering forThe Magical Mystery Tour.The bottom row shows vi-sualizations of the manualsegmentation for reference.

Cooper et al.

tion of Table 1. The y-axis in the visualizations in-dicates the cluster of each segment; the x-axisshows time.

In both Figures 3 and 4, the most visible structureis the song’s coda, from 141 sec to 171 sec, which isdistinct from the song’s verse, chorus, and bridgeelements. Its dissimilarity from the rest of thepiece is quantified by the dark regions of low cross-similarity in the bottom-most rows and rightmostcolumns of the similarity matrices. As expected,the reduced-dimension features show improveddiscrimination in the corresponding similaritymatrices. Overall, the PCA-projected spectrogramfeatures provide the best visualization of the piece’sstructure.

47

Figure 4. Similarity matri-ces computed from the fullspectrogram data (top left)and the SVD-projectedspectrograms (top right)

after low-pass filtering forThe Magical Mystery Tour.The bottom row shows vi-sualizations of the manualsegmentation for reference.

Table 1. Manual Segmentation of The Magical Mys-tery Tour

Segment Boundaries (sec)

Intro 0–10Verse (Voc.) 11–21Verse (Voc. and Inst.) 22–32Chorus 33–42Verse (Voc.) 43–53Verse (Voc. and Inst) 54–64Chorus 65–73Bridge 74–88Verse (Voc.) 89–102Verse (Voc. and Inst.) 103–117Chorus 118–141“Outro” 141–171

Beat Spectrum and Beat Spectrogram

Both the periodicity and relative strength of rhyth-mic structure can be derived from the similaritymatrix. The term beat spectrum is used to describea measure of self-similarity as a function of the lag(Foote and Uchihashi 2001). Peaks in the beat spec-trum at a particular lag l correspond to audio repeti-tions at that temporal rate. The beat spectrum B(l)can be computed from the similarity matrix usingdiagonal sums or autocorrelation methods. A simpleestimate of the beat spectrum can be found by diag-onally adding the similarity matrix S as follows:

(6)

Here, B(0) is simply the sum along the main diag-onal over some continuous range R, B(1) is the sumalong the first superdiagonal, and so on. A more ro-bust estimate of the beat spectrum is the autocorre-lation of S:

(7)

Because B(k,l) is symmetric, it is only necessary toperform the sum over one variable to yield a one-dimensional result B(l). This approach works sur-prisingly well for most kinds of musical genres,tempos, and rhythmic structures.

B k l i j i k j li j

( , ) ( , ) ( , ),

≈ + +∑S S

B l k k lk R

( ) ( , )≈ +⊂∑ S

Figure 5 shows the beat spectrum computed fromthe first ten seconds of Paul Desmond’s jazz compo-sition Take 5, performed by the Dave Brubeck Quar-tet. Besides being in an uncommon time signature(5/4), this rhythmically sophisticated work requiressome interpretation. First, note that there is no ob-vious periodicity at the actual beat tempo (denotedby solid vertical lines in the figure). Rather, there isa marked periodicity at five beats and a correspon-ding sub-harmonic at ten. Jazz aficionados knowthat “swing” is the subdivision of beats into non-equal periods rather than “straight” (equal) eighthnotes. The beat spectrum clearly shows that eachbeat is subdivided into near-perfect triplets. This isindicated with dotted lines spaced one-third of abeat apart between the second and third beats. Aclearer visualization of “swing” would be difficultto achieve by other means.

The beat spectrum can be analyzed to determinetempo and more subtle rhythmic characteristics.Peaks in the beat spectrum give the fundamentalrhythmic periodicity (Foote and Uchihashi 2001).Strong off-beats and syncopations can be then de-duced from secondary peaks in the beat spectrum.Because the only necessary signal attribute is repeti-tion, this approach is more robust than other ap-proaches that look for absolute acoustic featuressuch as energy peaks.

There is an inverse relationship between thetime accuracy and the beat spectral precision. Tech-nically, the beat spectrum is a frequency operatorand hence does not commute with a time operator.Thus, beat spectral analysis, like frequency anal-ysis, exhibits a tradeoff between spectral and tem-poral resolution.

The beat spectrogram is used to analyze rhyth-mic variations over time. Like its namesake, thebeat spectrorgram visualizes the beat spectrum oversuccessive windows to show rhythmic variationover time. Time is on the x-axis, with lag time onthe y-axis. Each pixel is colored with the scaledvalue of the beat spectrum at that time and lag, sothat peaks are visible as bright horizontal bars at therepetition time. Figure 6 shows the beat spectro-gram of a 33-second excerpt of the Pink Floyd songMoney. Listeners familiar with this classic-rockchestnut may know the song is primarily in the 7/4


Figure 5. Beat spectrum ofthe jazz composition TakeFive.

Cooper et al.

time signature, save for the bridge (middle section),which is in 4/4. The excerpt shown begins at 4 min55 sec into the song, and it clearly shows the transi-tion from the 4/4 bridge back into the last 7/4 verse.To the left are strong beat spectral peaks on eachbeat, particularly at two and four beats (the lengthof a 4/4 bar), along with an eight-beat subharmonic.Two beats occur in slightly less than a second, cor-responding to a tempo slightly faster than 120 beatsper minute (120 BPM). This is followed by a shorttwo-bar transition. Then, around 10 sec (on the x-axis) the time signature changes to 7/4, clearlyvisible as a strong seven-beat peak with the absenceof a four-beat component. The tempo also slowsslightly, visible as a slight lengthening of the timebetween peaks.

Beat Histograms

The beat histogram (BH) is similar to the beat spec-trum in that it visualizes the distribution of variousbeat-level periodicities of the input signal. How-ever, the method of calculation is different. The BHis calculated using periodicity detection in multipleoctave channels that are computed using a discretewavelet transform (DWT). Figure 7 shows aschematic diagram of the calculation. The signal isfirst decomposed into a number of octave frequencybands using the DWT. Following this decomposi-

tion, the time-domain amplitude envelope of eachband is extracted separately. This is achieved by ap-plying full-wave rectification, low-pass filtering,and downsampling to each octave frequency band.After removal of the mean, the envelopes of eachband are then added together, and the autocorrela-tion of the resulting sum envelope is computed. Thedominant peaks of the autocorrelation function cor-respond to the various periodicities of the signal’senvelope. These peaks are accumulated over thewhole sound file into a beat histogram, where eachbin corresponds to the peak lag,namely, the beatperiod in BPM.

Rather than adding one, the amplitude of eachpeak is added to the beat histogram. That way,when the signal is very similar to itself (strong beat)the histogram peaks will be higher. In Tzanetakisand Cook (2002), six numerical features that at-tempt to summarize the BH are computed and usedfor classification. Figure 8 shows a BH for a piece ofrock music (Come Together by the Beatles). (Noticethe peaks at 80 BPM—the main tempo—and 160BPM.) The x-axis corresponds to beats per minute,and the y-axis corresponds to the degree of self-similarity for that particular periodicity or beatstrength (Tzanetakis, Essl, and Cook 2002). Manyother algorithms for tempo and beat detection have

49

Figure 6. Beat spectrogramof Pink Floyd’s Moneyshowing the transitionfrom 4/4 time to 7/4 timearound 10 seconds (on thex-axis).

Figure 7. Flow diagram ofbeat histogram calculation.

been proposed in the literature and could be used asfront ends to similar visualizations to the beat his-togram and beat spectrum.

Real-Time Audio Classification Display

The GenreGram is a dynamic, real-time audio dis-play for showing automatic genre classification re-sults. More details about this process can be foundin Tzanetakis and Cook (2002). The classification isperformed using supervised learning where a statis-tical model of the feature distribution for each classis built during training with labeled samples. Oncethe classifier is trained, it can then be used to clas-sify music it has not encountered before. Althoughit can be used with any audio signal, it was designedfor real-time classification of live radio signals. Eachgenre is represented as a cylinder that moves up anddown in real time based on a classification confi-dence measure ranging from 0.0 to 1.0. Each cylin-der is texture-mapped with a representative imagefor each genre.

In addition to demonstrating real-time automaticmusical genre classification, the GenreGram pro-vides valuable feedback both to users and algorithmdesigners. Different classification decisions andtheir relative strengths are combined visually, re-vealing correlations and classification patterns. Be-cause in many cases the boundaries between genresare fuzzy, a display like this is more informative

than a single one-or-nothing classification decision.For example, both male speech and hip-hop are acti-vated in the case of a hip-hop song, as shown in Fig-ure 9. Of course, it is possible to use GenreGrams todisplay other types of audio classifications, such asinstruments, sound effects, and birdsongs.

Mapping Time-Varying Timbre to Color

The basic idea behind timbregrams (Tzanetakis andCook 2000a) is to map audio files to sequences ofvertical color stripes in which each stripe corre-sponds to a short slice of sound (typically 20 msecto 0.5 sec). Time is mapped from left to right. Thesimilarity of different files (context) is shown asoverall color similarity, while the similarity withina file (content) is shown by color similarity withinthe timbregram. For example, a file that has an ABAstructure, where section A and section B have differ-ent sound textures, will have an ABA structure incolor also. Although it is possible to manually createtimbregrams, they are typically created using PCAover automatically extracted feature vectors. Unlikeapproaches that directly map frequency content to


Figure 8. Beat histogramexample for a piece of rockmusic (30 sec clip fromCome Together by TheBeatles).

Figure 9. GenreGram, adynamic real-time visual-ization of musicclassification.

Cooper et al.

color such as Comparisonics (www.comparisonics.com), timbregrams allow any parametric audio rep-resentation to be used as a front end.

Two main approaches are used for mapping theprincipal components to color to create timbre-grams. If an indexed image is desired, then the firstprincipal component is divided equally, and each in-terval is mapped to an index of a colormap. Anystandard visualization colormap such as grayscaleor thermometer can be used. This approach worksespecially well if the first principal component ex-plains a large percentage of the variance of the dataset. In the second approach, the first three principalcomponents are normalized so that they have equalmeans and variances. Although this normalizationdistorts the original feature space, in practice it pro-vides more satisfactory results as the colors aremore clearly separated. Each of the first three prin-cipal components is mapped to coordinates in aRGB or HSV color space. In this approach, a full-color timbregram is created.

There is a tradeoff between the ability to showsmall-scale local structure and global overall simi-larity depending on the quantization levels and theamount of variance in color range allowed. For ex-ample, by allowing many quantization levels and alarge color-range variance, different sections of thesame audio file can be visually distinguished. If onlyan overall color is desired, fewer quantization levelsand smaller variation should be used.

It is important to note that the similarity in colordepends not only on the particular file (content) butalso the collection over which the PCA is calculated(context). That means that two files might havetimbregrams with similar colors as part of collec-tion A and timbregrams with different colors aspart of collection B. For example, a string quartetand orchestral piece will have different timbre-grams if viewed as part of a classical music collec-tion but similar timbregrams if viewed as part of acollection that contains files from many differentmusical genres.

Timbregrams can be arranged in two-dimensionaltables for browsing. The table axis can be eithercomputed automatically or manually created. Forexample, one axis might correspond to the year ofrelease, and the other might correspond to the auto-

matically extracted tempo of the song. In addition,timbregrams can be superimposed over traditionalwaveform displays (see Figure 10) and texture mappedover objects in a timbre space, described later.

Figure 11 shows the timbregrams of six soundfiles. The three files on the left column containspeech and the three on the right contain classicalmusic. It is easy to visually separate music andspeech even in the grayscale image. It should benoted that no explicit class model of music andspeech is used and the different colors are a directresult of the visualization technique. The bottom-right sound file (opera) is light purple and the speechsegments are light green. In this mapping, light andbright colors correspond to speech or singing (Figure11, left). Purple and blue colors typically correspondto classical music (Figure 11, right).

Timbregrams of pieces of orchestral music areshown in Figure 12. From the figure, it is clear thatthe fourth piece from the top has an AB structure,where the A part is similar to the second piece fromthe top and the B part is similar to the last piecefrom the top. The A part is light pink (in color) orlight gray (in grayscale), and the B part is dark purple(in color) or dark grey (in grayscale). This is con-firmed by listening to the corresponding pieces inwhich A is a loud, energetic movement where theentire orchestra is playing and B is a lightly orches-trated flute solo.

Visualizing Music Collections

Managing the increasing size of digital music andsound collections is challenging. Traditional toolssuch as the file browser provide little information toassist this process. In this section, some approachesto visualizing large music collections for browsingand retrieval are described.

Timbre Spaces

The Timbre Space Browser (Tzanetakis and Cook2000b) maps each audio file to an object in a two- orthree-dimensional virtual space. The main proper-ties that can be mapped are the x, y, and z coordi-

51

nates for each object. In addition, a shape, textureimage or color, and text annotation can be providedfor each object. Standard graphical operations suchas display zooming, panning, and rotating can beused to explore the browsing space. Data model op-erations such as section pointing and semanticzooming are also supported. Selection specificationcan also be performed by specifying constraints onthe browser and object properties. For example, theuser can ask to select all the files that have positivex values, triangular shapes, and red color. Principalcurves, originally proposed in Hastie and Stuetzle(1989) and used for sonification in Herman,

Meinicke, and Ritter (2000) can be used to move se-quentially through the objects.

Figure 13 shows a two-dimensional timbre spaceof sound effects. The icons represent differenttypes of sound effects such as walking (dark squares)and various other types of sound effects (whitesquares) such as tools, telephones, and door-bellsounds. Although the icons have been assignedmanually, the x and y coordinates of each icon arecalculated automatically based on audio features.This way, files that are similar in content are visu-ally clustered together, as can be seen from the fig-ure where the dark walking sounds occupy the left


Figure 10. Timbregram su-perimposed over a wave-form.

Cooper et al.

side of the figure while the white sounds occupy theright side.

Figure 14 shows a 3-D timbre space of differentpieces of orchestral music. Each piece is representedas a colored rectangle. The x, y, and z coordinatesare automatically extracted based on music similar-ity and the rectangle coloring is based on timbre-grams. These figures contain fewer objects thantypical configurations for clarity of presentation onpaper. Audio collections with sizes that typicallyrange from 100 to 1,000 files/objects can easily beaccommodated with timbre spaces.

Music Similarity via Self-Organizing Maps andSmoothed Data Histograms

Islands of Music is a graphical interface to musiccollections (Pampalk 2001; Pampalk, Rauber, andMerkl 2002a). Similar pieces of music are automati-cally clustered into groups and visualized as islands.On an island, mountains and hills represent sub-groups of similar pieces. Land bridges connect re-lated islands. The music is arranged such thatsimilar pieces and groups are close to each otheron the map.

Islands of Music is based on the self-organizingmap (SOM) neural network algorithm (Kohonen

53

Figure 11. Timbregrams ofspeech (left ) and music(right). (The left column islight green, and the rightcolumn is dark purple.)

Figure 12. Timbregrams oforchestral music pieces.

Figure 12

Figure 11


Figure 13. Two-dimensional timbre spaceof sound effects.

Figure 14. Three-dimensional timbre spaceof orchestral music pieces.

Figure 13

Figure 14

Cooper et al.

2001) combined with the smoothed data histogram(SDH) visualization (Pampalk, Rauber, and Merkl2002b). The SOM consists of units arranged on afixed grid. Each unit represents a prototypical pieceof music. The pieces in the collection are mapped tothe prototype that is most similar (also known asthe best-matching unit). During the training pro-cess, the units are adapted to better represent thepieces mapped to them, with the constraint thatunits close to each other on the grid represent simi-lar music.

The SDH visualization shows the distribution ofthe music collection on the map. It is a rough androbust estimate of the corresponding probabilitydensity function. In particular, for each piece, theweighted contribution of the n closest units are ac-cumulated in the histogram bins. One of the resultsis that the land bridges between related islands be-come more apparent. The SDH informs the user inwhich areas there is a high density of pieces (“is-lands”) and which areas are populated sparsely (the“sea”). The color mapping used has the followingrange: dark blue (deep sea), light blue (shallow sea),yellow (beach), light green (grass), dark green (for-est), gray (rocks), and white (snow).

The most critical component of the Islands ofMusic interface is the similarity measure that isused. Several measures are available (e.g., Pampalk2004); however, none of these performs comparablyto similarity ratings by a human listener.

An example for Islands of Music is shown in Fig-ure 15. (The same map in two dimensions is shownin Figure 16.) The snow-covered mountain on thelower left represents more aggressive music fromPapa Roach and Limp Bizkit (a mix of metal, punk,and rap). Following the land bridge to the mountaintoward the right leads to less aggressive music, in-cluding Living in a Lie by Guano Apes, Not an Ad-dict by K’s Choise, and Adia by Sarah McLachlan(all of which are slow songs sung by women andsound quite similar); this area also includes songssuch as Yesterday by the Beatles, California Dream-ing by The Mamas and The Papas, and House of theRising Sun by The Animals. The third snow-covered mountain (lower right) represents classicalmusic such as Für Elise by Beethoven. Other pieces

on the same island include orchestra pieces andslow love songs.

A frequently asked question is what the x-axisand y-axis represent. If the mapping were linear, thetwo dimensions would correspond to the first twoprincipal components of the data. However, themain advantage of the SOM compared to the PCA isits ability to map the data nonlinearly. Thus, it isnot possible to directly label the axes. Nevertheless,different options to explain the regions of the mapare available such as “weather charts” (Pampalk,Rauber, and Merkl 2002a). The idea is to visualize athird dimension (temperature, air pressure, strengthof bass beats, etc.) on top of a map.

An example visualizing the distribution of bassbeats is shown in Figure 17. Note that instead of thegrayscale used here, we usually use a color scaleranging from blue (low values) to red (high values).The classical music island has the weakest bassbeats, the two other mountains described previ-ously have about average bass beats, and most of theislands in the upper region have very strong bassbeats. For example, very strong bass beats can befound on the upper-left island, which containsmainly music from Bomfunk MC’s (a mix of hip-hop, electro-funk, and house).

In the previous example, only 77 pieces weremapped. The same approach can be used for largercollections, as shown in Figure 18. This simply re-quires a zoom function. Alternatives include usinggrowing hierarchical SOMs to organize music(Rauber, Pampalk, and Merkl 2002). To allow theuser to fully benefit from hierarchical organizations,it would be useful to have an automatic summariza-tion of individual music pieces and sets of pieces,for example, a short sequence of music that is typi-cal for a whole island.

Figure 18 shows how an organization for a largercollection might look like. The highlighted islandcontains mainly music from Bomfunk MC’s. Theisland a bit to the lower left contains mainly RedHot Chili Peppers, and the mountain on the oppo-site side of the map (lower-left corner) containsmainly classical pieces.

Figure 19 shows the organization of 3,298 piecesfrom the magnatune.com collection. On the Web

55


Figure 15. Islands of Music. Figure 16. Flat view of Fig-ure 15 with song labelsadded.

Figure 15

Figure 16

Cooper et al.

site, the pieces are organized into eleven genres,with most pieces labeled as classical music. Fig-ure 19 shows the distribution of the genres on theSOM. Note that most genres are not an isolated andwell-defined cluster but rather spread out over thewhole map and have significant overlap with mostother genres. In particular, world music is signifi-cantly spread out. The most compact cluster ispunk music, which is located opposite from classi-cal music.

Combining Different Views

So far, the assumption for the Islands of Music visu-alization was that there is one overall similaritymeasure according to which the pieces are organ-ized. However, music similarity has many dimen-sions, such as tempo, rhythm, instrumentation,lyrics, cultural context, harmony, melody, and soon. Each of these dimensions defines a specificview of the music collection. These views can becombined using aligned-SOMs (Pampalk, Dixon,and Widmer 2004) to allow the user to smoothlyand gradually change focus between the views.

57

Figure 17. Weather charts,revealing areas with strongbass beats.

Aligned-SOMs are comprised of many individualSOMs of the same size stacked on top of eachother. Each SOM represents a specific view, for ex-ample, derived from mixing 20 percent timbre and80 percent rhythmic similarity. NeighboringSOMs represent similar views (i.e., similar mixingweights). During training of the SOMs, an addi-tional constraint is enforced to ensure that eachpiece of music is located in the same area on neigh-boring SOMs.

Figure 20 shows an aligned-SOM combining aview based on rhythmic properties and a view basedon MFCCs (describing spectral characteristics re-lated to timbre). In the upper part of the screenshot

is the Islands of Music view. For example, in thecurrent view, classical piano pieces are located inthe lower left and pieces from Papa Roach are in theupper right. Below are the codebooks, which areonly of interest to the researcher studying specificcharacteristics of the similarity measures used. Be-low these codebooks is a slider that allows the userto focus more on rhythmic similarity (by movingthe slider to the left) or timbre-based similarity (bymoving the slider to the right). When the slider ismoved, the pieces on the map are slowly rearrangedto adjust to the new definition of similarity. Asthe pieces move, also the islands slowly move, sink,or emerge.


Figure 18. Islands of Musicwith 359 pieces.

Cooper et al. 59

Figure 19. Visualizationof genre distributions inthe magnatune.comcollection.

WORLD

ROCK

PUNK

POP

NEW AGE

METAL

JAZZ

FOLK

AMBIENT

ELECTRONIC

CLASSICAL

Conclusions

A variety of audio visualization techniques havebeen proposed in the context of MIR for represent-ing music pieces and collections of them visually.These techniques rely on state-of-the-art audio sig-nal processing and machine-learning techniquesthat automatically extract content informationfrom audio signals that is subsequently mapped to

visual attributes. By visualizing content informa-tion, one can take advantage of the strong pattern-recognition abilities of the human visual system toidentify structure and patterns of audio signals.For example, the ABA structure of some piece ofmusic can be immediately recognized in a visuali-zation such as a similarity matrix or a timbregrambut requires a few minutes of listening to be identi-fied aurally.


Figure 20. Combining dif-ferent views using aligned-SOMs.

Cooper et al.

A related field that is also still in its infancy is thevisualization of symbolic data and connections tocommon music notation. The explicit visualization ofpitch information and its integration with timbre andrhythm visualization are another area of future re-search. Evaluation is one of the biggest challenges inany type of visualization and typically requires exten-sive user studies. The field of MIR is new, and there-fore related visualization techniques are still mostlyan academic curiosity, and their evaluation has beenmostly informal. It is our hope that these ideas willserve as seeds for highly interactive visual inter-faces for exploring large collections of music in thefuture and more large-scale evaluation experiments.

References

Fayad, U., G. Grinstein, and A. Wierse, eds. 2002. Informa-tion Visualization in Data Mining and Knowledge Dis-covery. Burlington, Massachusetts: Morgan Kaufman.

Fernström, M., and E. Brazil. 2001. “Sonic Browsing: AnAuditory Tool for Multimedia Asset Management.”Proceedings of the International Conference on Audi-tory Display: 132–135.

Foote, J., and M. Cooper. 2003. “Media Segmentation Us-ing Self-Similarity Decomposition.” Proceedings ofSPIE 5021:167–175.

Foote, J., and S. Uchihashi. 2001. “The Beat Spectrum: ANew Approach to Rhythm Analysis.” Proceedings ofthe IEEE International Conference on Multimedia andExpo (ICME). Piscataway, New Jersey: Institute of Elec-trical and Electronics Engineers.

Grey, J. M. 1975. “An Exploration of Musical Timbre.”Ph.D. thesis, Department of Psychology, StanfordUniversity.

Hastie, T., and W. Stuetzle. 1989. “Principal Curves.”Journal of the American Statistical Association 84(406): 502–516.

Herman, T., P. Meinicke, and H. Ritter. 2000. “PrincipalCurve Sonification.” Proceedings of the 2000 Interna-tional Conference on Auditory Display. New York:Association for Computing Machinery.

Jolliffe, I. 1986. Principal Component Analysis. NewYork: Springer.

Koenig, W., H. K. Dunn, and L. Y. Lacey. 1946. “TheSound Spectrograph.” Journal of Acoustical Societyof America 18:19–49.

Kohonen, T. 2001. Self-Organizing Maps. Berlin: Springer.Malinowski, S. 1988. “The Music Animation Machine.”

Available online at www.well.com/user/smalin/mam.html.

Marchionini, G. 1995. Information Seeking in ElectronicEnvironments. Cambridge: Cambridge UniversityPress.

Moritz, M. 1996. “Mary Ellen Bute: Seeing Sound.” Ani-mation World 1(2):29–32.

Pampalk, E. 2001. Islands of Music. Master’s thesis, ViennaUniversity of Technology.

Pampalk, E. 2004. “A MATLAB Toolbox to Compute Sim-ilarity from Audio.” Proceedings of the 2004 Interna-tional Conference on Music Information Retrieval.Barcelona: Universitat Pompeu Fabra, pp. 254–257.

Pampalk, E., S. Dixon, and G. Widmer 2004. “ExploringMusic Collections by Browsing Different Views.”Computer Music Journal 28(2):49–62.

Pampalk, E., A. Rauber, and D. Merkl 2002a. “Content-Based Organization and Visualization of MusicArchives.” Proceedings of ACM Multimedia. NewYork: Association for Computing Machinery,pp. 570–579.

Pampalk, E., A. Rauber, and D. Merkl. 2002b. “UsingSmoothed Data-Histograms for Cluster Visualizationin Self-Organizing Maps.” Proceedings of the Interna-tional Conference on Artificial Neural Networks.Piscataway, New Jersey: Institute of Electrical andElectronics Engineers, pp. 871–876.

Potter, P., G. Kopp, and H. Green. 1947. Visible Speech.New York: Van Nostrand.

Rauber, A., E. Pampalk, and D. Merkl. 2002. “UsingPsycho-Acoustic Models and Self-Organizing Maps toCreate a Hierarchical Structuring of Music by SoundSimilarities.” Proceedings of the 2002 InternationalConference on Music Information Retrieval. Paris:IRCAM, pp. 71–80.

Sapp, C. S. 2001. “Harmonic Visualizations of Tonal Mu-sic.” Proceedings of the 2001 International ComputerMusic Conference. San Francisco, California: Interna-tional Computer Music Association, pp. 423–430.

Slaney, M. 1997. “Connecting Correlograms to Neuro-physiology and Psychoacoustics.” Paper presentedat the XI International Symposium on Hearing,Grantham, UK, 1–6 August.

Smith, S. M., and G. Williams. 1997. “A Visualization ofMusic.” Proceedings of Visualization ’97. New York:Association for Computing Machinery, pp. 499–502.

Spence, R. 2001. Information Visualization. Boston:Addison-Wesley.

61

Strang, G. 1988. Linear Algebra and its Applications. Or-lando, Florida: Harcourt Brace Jovanovich.

Tzanetakis, G., and P. Cook. 2000a. “Audio InformationRetrieval (AIR) Tools.” Proceedings of the 2000 Inter-national Conference on Music Information Retrieval.Plymouth, Massachusetts: University of Massachusettsat Amherst.

Tzanetakis, G., and P. Cook. 2000b. “3D Graphics Toolsfor Sound Collections.” Paper presented at the 3rdInternational Conference on Digital Audio Effects,Verona, Italy, 7–9 December.

Tzanetakis, G., and P. Cook. 2002. “Musical Genre Clas-sification of Audio Signals.” IEEE Transactions onSpeech and Audio Processing 10(5):293–302.

Tzanetakis, G., G. Essl, and P. Cook. 2002. “Human Per-ception and Computer Extraction of Beat Strength.”Paper presented at the 5th International Conferenceon Digital Audio Effects, Hamburg, Germany, 26–28 September.

Vertegal, R., and E. Bonis. 1994. “ISEE: An IntuitiveSound Editing Environment.” Computer Music Journal18 (2): 21–29.

Wessel, D. 1979. Low-Dimensional Control of MusicalTimbre. Paris: Centre Georges Pompidou.


Visualization in Audio- Based Music Information Retrieval

Documents

Transcript of Visualization in Audio- Based Music Information Retrieval