comparing rnn parameters for melodic similarityismir2018.ircam.fr/doc/pdfs/61_paper.pdf ·...

8
COMPARING RNN PARAMETERS FOR MELODIC SIMILARITY Tian Cheng, Satoru Fukayama, Masataka Goto National Institute of Advanced Industrial Science and Technology (AIST), Japan {tian.cheng, s.fukayama, m.goto}@aist.go.jp ABSTRACT Melodic similarity is an important task in the Music In- formation Retrieval (MIR) domain, with promising appli- cations including query by example, music recommenda- tion and visualisation. Most current approaches compute the similarity between two melodic sequences by compar- ing their local features (distance between pitches, intervals, etc.) or by comparing the sequences after aligning them. In order to find a better feature representing global charac- teristics of a melody, we propose to represent the melodic sequence of each musical piece by the parameters of a gen- erative Recurrent Neural Network (RNN) trained on its se- quence. Because the trained RNN can generate the identi- cal melodic sequence of each piece, we can expect that the RNN parameters contain the temporal information within the melody. In our experiment, we first train an RNN on all melodic sequences, and then use it as an initialisation to train an individual RNN on each melodic sequence. The similarity between two melodies is computed by using the distance between their individual RNN parameters. Ex- perimental results showed that the proposed RNN-based similarity outperformed the baseline similarity obtained by directly comparing melodic sequences. 1. INTRODUCTION Melodic similarity is a task to analyse the similarity be- tween melodies, which has been used for music retrieval, recommendation, visualisation and so on. To compute the similarity, a melody is always represented by a sequence of monophonic, musical fragments/events (MIDI event, pitch, etc.). Current approaches usually compare two melodic sequences using the string edit distance [8, 9, 17], geometric measures [19] and N-Gram based measures [5, 27]. Alignment-based methods are applied when two melodic sequences are of different lengths [15, 23], or when events of two sequences are not corresponding to each other one by one [2]. Not only melodic sequence but also melody slopes on continuous melody contours are aligned for comparing melodic similarity [28]. Read- ers can refer to [25] for state-of-the-art melodic similar- c Tian Cheng, Satoru Fukayama, Masataka Goto. Li- censed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Tian Cheng, Satoru Fukayama, Masataka Goto. “Comparing RNN parameters for melodic similarity”, 19th In- ternational Society for Music Information Retrieval Conference, Paris, France, 2018. ity methods. The existing methods focus on local features extracted from melodic sequences, such as distances be- tween pitches or between subsets of melodic sequence (N- Gram). In addition alignment is needed when two melodic sequences are not comparable directly. In order to deal with these drawbacks, we propose to train a generative Recurrent Neural Network (RNN) on a melodic sequence, and use the RNN parameters to repre- sent the melodic sequence. The proposed feature (RNN parameters) projects a melodic sequence to a point in the parameter space, having two characteristics described as follows. Firstly, the feature is independent to the length of the input melodic sequence because every sequence is rep- resented by its RNN parameters of the same dimension. Secondly, because the RNN can generate an identical se- quence, we can expect that the RNN parameters contain the global, temporal information of the melody. In our experiment, we first train an RNN on all melodic sequences from 80 popular songs as an initialisation. With the initialisation, RNNs are trained on individual melodic sequences. All the networks are trained in tensorflow. We compute the similarity between two melodic sequences by the Cosine similarity of their RNN parameters. The results show that the similarity based on RNN parameters outper- forms the baseline similarity obtained by comparing the melodic sequences directly. To the best of our knowledge, this is the first study that uses parameters of generative RNNs for the purpose of computing melodic similarity. 2. RELATED WORK In this section, we introduce related work on RNN- based melody generation models, and briefly introduce re- searches on word and sentence embedding for understand- ing semantic meanings in natural language processing. 2.1 RNN-based melody generation models We discuss several state-of-the-art RNN-based melody generation models. The RNN-based generative models are usually applied with Long Short Term Memory (LSTM) units in order to model a long time dependence, such as Melody RNN in Magenta [1] and folk-rnn [22]. Ma- genta [1] uses 2-layer RNNs with 64 or 128 LSTM units per layer, while folk-rnn [22] uses a deeper network (RNN with 3 hidden layers of 512 LSTM units for each layer). The RNNs generate melody by predicting the next melodic event based on its previous N events: [x t-N , ..., x t-1 ] x t , 763

Upload: others

Post on 27-Apr-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COMPARING RNN PARAMETERS FOR MELODIC SIMILARITYismir2018.ircam.fr/doc/pdfs/61_Paper.pdf · 2018-09-13 · COMPARING RNN PARAMETERS FOR MELODIC SIMILARITY Tian Cheng, Satoru Fukayama,

COMPARING RNN PARAMETERS FOR MELODIC SIMILARITY

Tian Cheng, Satoru Fukayama, Masataka GotoNational Institute of Advanced Industrial Science and Technology (AIST), Japan

{tian.cheng, s.fukayama, m.goto}@aist.go.jp

ABSTRACT

Melodic similarity is an important task in the Music In-formation Retrieval (MIR) domain, with promising appli-cations including query by example, music recommenda-tion and visualisation. Most current approaches computethe similarity between two melodic sequences by compar-ing their local features (distance between pitches, intervals,etc.) or by comparing the sequences after aligning them.In order to find a better feature representing global charac-teristics of a melody, we propose to represent the melodicsequence of each musical piece by the parameters of a gen-erative Recurrent Neural Network (RNN) trained on its se-quence. Because the trained RNN can generate the identi-cal melodic sequence of each piece, we can expect that theRNN parameters contain the temporal information withinthe melody. In our experiment, we first train an RNN onall melodic sequences, and then use it as an initialisationto train an individual RNN on each melodic sequence. Thesimilarity between two melodies is computed by using thedistance between their individual RNN parameters. Ex-perimental results showed that the proposed RNN-basedsimilarity outperformed the baseline similarity obtained bydirectly comparing melodic sequences.

1. INTRODUCTION

Melodic similarity is a task to analyse the similarity be-tween melodies, which has been used for music retrieval,recommendation, visualisation and so on. To compute thesimilarity, a melody is always represented by a sequenceof monophonic, musical fragments/events (MIDI event,pitch, etc.). Current approaches usually compare twomelodic sequences using the string edit distance [8, 9, 17],geometric measures [19] and N-Gram based measures[5, 27]. Alignment-based methods are applied when twomelodic sequences are of different lengths [15, 23], orwhen events of two sequences are not corresponding toeach other one by one [2]. Not only melodic sequencebut also melody slopes on continuous melody contoursare aligned for comparing melodic similarity [28]. Read-ers can refer to [25] for state-of-the-art melodic similar-

c© Tian Cheng, Satoru Fukayama, Masataka Goto. Li-censed under a Creative Commons Attribution 4.0 International License(CC BY 4.0). Attribution: Tian Cheng, Satoru Fukayama, MasatakaGoto. “Comparing RNN parameters for melodic similarity”, 19th In-ternational Society for Music Information Retrieval Conference, Paris,France, 2018.

ity methods. The existing methods focus on local featuresextracted from melodic sequences, such as distances be-tween pitches or between subsets of melodic sequence (N-Gram). In addition alignment is needed when two melodicsequences are not comparable directly.

In order to deal with these drawbacks, we propose totrain a generative Recurrent Neural Network (RNN) on amelodic sequence, and use the RNN parameters to repre-sent the melodic sequence. The proposed feature (RNNparameters) projects a melodic sequence to a point in theparameter space, having two characteristics described asfollows. Firstly, the feature is independent to the length ofthe input melodic sequence because every sequence is rep-resented by its RNN parameters of the same dimension.Secondly, because the RNN can generate an identical se-quence, we can expect that the RNN parameters containthe global, temporal information of the melody.

In our experiment, we first train an RNN on all melodicsequences from 80 popular songs as an initialisation. Withthe initialisation, RNNs are trained on individual melodicsequences. All the networks are trained in tensorflow. Wecompute the similarity between two melodic sequences bythe Cosine similarity of their RNN parameters. The resultsshow that the similarity based on RNN parameters outper-forms the baseline similarity obtained by comparing themelodic sequences directly. To the best of our knowledge,this is the first study that uses parameters of generativeRNNs for the purpose of computing melodic similarity.

2. RELATED WORK

In this section, we introduce related work on RNN-based melody generation models, and briefly introduce re-searches on word and sentence embedding for understand-ing semantic meanings in natural language processing.

2.1 RNN-based melody generation models

We discuss several state-of-the-art RNN-based melodygeneration models. The RNN-based generative models areusually applied with Long Short Term Memory (LSTM)units in order to model a long time dependence, such asMelody RNN in Magenta [1] and folk-rnn [22]. Ma-genta [1] uses 2-layer RNNs with 64 or 128 LSTM unitsper layer, while folk-rnn [22] uses a deeper network (RNNwith 3 hidden layers of 512 LSTM units for each layer).

The RNNs generate melody by predicting the nextmelodic event based on its previous N events:

[xt−N , ..., xt−1]→ xt,

763

Page 2: COMPARING RNN PARAMETERS FOR MELODIC SIMILARITYismir2018.ircam.fr/doc/pdfs/61_Paper.pdf · 2018-09-13 · COMPARING RNN PARAMETERS FOR MELODIC SIMILARITY Tian Cheng, Satoru Fukayama,

Models Representation Architecture

Magenta [1] MIDI event 2-layer RNN (LSTM)

Folk-rnn [22] abc notation 3-layer RNN (LSTM)

Hierarchical bar profile, beat 3 RNNs (2-layer LSTM)RNN [26] profile and note for bar, beat and note

Table 1: Brief summary of RNN-based melody generationmodels.

where xt denotes the melodic event in time t. The melodicevent can be represented in many forms, for example MIDIevents [1], abc notation [22] and so on, as shown in Table 1.With quantised time steps (in sixteenth notes, for example),a melody can be represented as a sequence of pitches 1 orMIDI events (pitch onset, offset, and no event) [1].

Rhythm information can also be modelled for melodygeneration. One simple way is to concatenate beat infor-mation with the melodic event for each frame to feed intothe network [1]. There are also several hierarchical RNNsproposed with rhythm information. In [4], each note isrepresented by its pitch and duration, and 2 RNNs (rhythmand melody RNNs) are trained for duration and pitch, re-spectively. The rhythm network receives the current pitchand duration as inputs, and outputs the duration of thenext note. The melody network receives the current pitchand generated upcoming duration as inputs to generate thepitch of the next note. [26] trains 3 RNNs for bar, beat, andnote, respectively. The first RNN generated bar profiles.Generated bar profiles are fed into the second network togenerate beats, and then bar and beat profiles are fed intothe third network to generate notes.

Studies of generative RNN models always list gener-ated examples [1, 22] as results, or conduct a listening testfor evaluation [26]. We believe that the generative RNNactually learns something ‘musical’ and can be used formusic analysis. In this paper we extend the utility of thegenerative RNN to represent a melody and evaluate it in amelodic similarity task.

2.2 Word embedding and sentence embedding

In natural language processing, word embedding and sen-tence embedding work on representing semantic meaningsof words and sentences. There are two successful wordembedding models introduced in [13, 14]: word represen-tations are learnt in order to predict surrounding words orto predict the current word by its content. In these ways,the meaning of a word is related to its context. With theembedded words, a representative vector for a sentence (asequence of words) can be learned at the same time of pars-ing the sentence [21] or can be trained in a weakly super-vised way on the click-through data by making sentencevectors with similar meanings close to each other [18]. In-spired by word embeding, [11] learns to represent a para-graph by predicting words in the paragraph using previouswords and a paragraph vector. The same paragraph vector

1 https://brangerbriz.com/blog/using-machine-learning-to-create-new-melodies/

is shared when predicting words in the paragraph and thenis used to represent the paragraph.

We believe that word embedding may correspond tochord embedding [3, 12] in understanding music; and sen-tence embedding may correspond to representing a se-quence of chords (also an interesting topic to investigate).In general, the musical meaning (of a sequence of pitchesor chords) is less intuitive than the textual meaning (of aword or a sentence). Thus, it is more difficult to learn agood representation for a musical sequence. In this paperwe work on representing a melody (a sequence of pitches).We train an RNN model to predict the current pitch by itsprevious pitches in a melody and represent the melody bythe RNN parameters. To the best of our knowledge, thisis the first work to use network parameters directly as arepresentation.

3. TRAINING RNNS

For each melodic sequence, we train a generative RNN onit. The parameters of the trained RNN will be used as a fea-ture to represent the melody. We first train an initialisationon all melodic sequences, and then train on individuallymelodic sequences with the initialisation.

3.1 Data

We conduct the experiment on the RWC Music Database(Popular Music) [7]. There is a subjective similarity study[10] undertaken on 80 songs (RWC-MDB-P-2001 No.1-80) of the RWC Music Database. In this study 27 partici-pants are asked to vote the similarity (on melody, rhythm,vocals and instruments, respectively) for 200 pairs of clipsafter listening to them. Each clip lasts for 30 seconds (start-ing from the first chorus starting time). For these pairs ofclips, the similarity votes range from 0 to 27. 2 The largerthe vote is, the more similar the clips are. The melodicsimilarity matrix is shown in Figure 1, indicating the simi-larity scores of 200 pairs of clips. The matrix is symmetricbecause if a is similar to b, it means that b is similar to a aswell. There are 400 non-zero values in the matrix (twiceof 200 because of the symmetry).

We use the same 30-second clip as in the subjectivestudy [10] from each song for training RNNs. We de-note the clip from piece ‘RWC-MDB-P-2001 No.X’ asclip X, X∈ [1, 80]. The melodic similarity results of thisstudy [10] are used as the ground truth for evaluation.

3.2 Arranging the training data

We train RNNs using the melody annotation of the RWCMusic Database (Popular Music) from the publicly avail-able AIST Annotation [6]. A melody in the annotation isrepresented as a fundamental frequency sequence in 10 msframes as shown in Figure 2(a). We call the frames withfrequencies ‘melody frames’, and the frames without fre-quencies ‘silent frames’. We convert the frequencies (f )

2 The dataset [10] has been publicly available on the web page of theRWC Music Database at http://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation/SSimRWC/.

764 Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018

Page 3: COMPARING RNN PARAMETERS FOR MELODIC SIMILARITYismir2018.ircam.fr/doc/pdfs/61_Paper.pdf · 2018-09-13 · COMPARING RNN PARAMETERS FOR MELODIC SIMILARITY Tian Cheng, Satoru Fukayama,

10 20 30 40 50 60 70 80

Clip No.

10

20

30

40

50

60

70

80C

lip N

o.

Figure 1: The melodic similarity of 200 pairs of clips.

into pitches (p – indicated by MIDI indices) for melodyframes:

p = 69 + 12 log2f

440. (1)

The histogram of the pitches in the training set is shownin Figure 3. We focus on pitches in 3 octaves ranging from43 to 78. Frames with pitches beyond this range are con-sidered as silent frames.

3.2.1 Frame hop size

The original frames are arranged in a hop size of 10 ms.We use a hop size of 50 ms (shown in Figure 2(b)) be-cause RNNs tend to repeat the previous frames with a smallframe hop size.

3.2.2 Skip silent frames

Because of the high ratio of the silent frames (shown inFigure 2(b)), there will be many invalid training sampleswith a sequence of silent frames to predict a silent frameif we use all frames in the training data. Therefore, wesimply skip all the silent frames to discard those invalidtraining samples, resulting in a pitch sequence with onlymelody frames (shown in Figure 5(b)).

We aim to look back for 2 seconds to predict the nextframe. With a frame hop size of 50 ms, there are 40 framesin the input sequence: [xt−N , ..., xt−1]→ xt, N = 40.

3.2.3 Zero-padding at the beginning

We find if the first training sample is [x0, ..., x39] → x40,then the generation of the first 40 frames are not modelledin the RNN. In order to generate the whole sequence, weconcatenate a sequence of 40 silent frames in the front ofeach clip, with the first training sample of [xS , ..., xS ] →x0 (xS is the silent frame padding in the front of the clip).

3.3 Network architecture

We apply a network architecture similar to Megenta [1],but with GRU cells instead of LSTM cells to reduce the

(a) A sequence of fundamental frequencies.

(b) A sequence of pitches

Figure 2: Melodic sequences with different frame hopsizes. Frames with values of 0 are silent frames.

Figure 3: The histogram of the pitches in the dataset.

parameter dimensions. The RNN contains 2 hidden layerswith 64 GRU cells per layer. The output layer is a fully-connected layer with a softmax activation function. Theinputs are one-hot encoded vectors with a dimension of 37(36 pitches and a silent state). We hope the RNN can fitthe individual pitch sequences as much as possible. In thiscase, overfitting is intended and not a problem any longer;hence no drop out is applied.

The network is trained by minimising the cross entropyloss using Adam optimisation with learning rate of 0.001(other parameters of Adam are with default values in ten-sorflow).

3.4 Initialisation and training on individual clips

In order to gain a consistent training, we use a fixed ini-tialisation. The initialisation is trained on the training sam-ples from all 80 clips for 100 epochs. Then with this ini-tialisation, we train an individual RNN on each melodicsequence for 500 iterations. 3 After data arrangement ofSection 3.2, there are around 200-600 training samples for

3 An iteration means RNN parameters are updated once on a batch oftraining samples. In contrast, an epoch means a full training on all train-ing samples. We use the iteration number to stop training because in thisway RNN parameters are updated for the same times, hence more com-parable. However, when to stop training still needs further investigation.

Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 765

Page 4: COMPARING RNN PARAMETERS FOR MELODIC SIMILARITYismir2018.ircam.fr/doc/pdfs/61_Paper.pdf · 2018-09-13 · COMPARING RNN PARAMETERS FOR MELODIC SIMILARITY Tian Cheng, Satoru Fukayama,

Initialisation Individual RNNs

No. of RNNs 1 RNN 80 RNNs

Training data 80 clips each clip

Batch size 512 64

Early stop 100 epochs 500 iterations

Table 2: RNN training settings.

(a) Batch acc. for initialisation. (b) Batch loss for initialisation.

(c) Batch acc. for training on clip 1. (d) Batch loss for training on clip 1.

Figure 4: Batch accuracies and losses of training for ini-tialisation and training on clip 1 with the initialisation.

every clip. We use a large batch size of 512 for initialisa-tion training because of a big number of training samples,and a smaller batch size of 64 for training for each individ-ual sequence. Training settings are shown in Table 2.

Training for initialisation and training on clip 1 areshown in Figure 4. After training for initialisation, thebatch accuracy reaches 0.7 (Figure 4(a)) and the batch lossdecreases to around 0.8 (Figure 4(b)). After training onclip 1 with the initialisation, the batch accuracy further in-creases from 0.7 to 1 (Figure 4(c)); and the batch loss re-duces from 0.8 to around 0.1 (Figure 4(d)). With the RNNtrained on clip 1, we can generate an identical melodic se-quence, as shown in Figure 5.

3.5 Cosine similarity between RNN parameters

The parameter dimensions of an RNN are shown in Ta-ble 3. The total number of parameters is 46, 757.

We reshape matrices to vectors, and concatenate thevectors. The concatenated parameters of the initialisationRNN and RNNs trained on clip 3 and clip 80 are shown inFigure 6. The differences in parameters of different RNNsare subtle. The similarity between two clips is indicatedby the Cosine similarity between their concatenated RNNparameters. The larger the Cosine similarity is, the moresimilar the clips are.

In the data arrangement stage (see Section 3.2), themelody of a clip (30 seconds) is represented as a se-quence of pitches of 600 frames (including silent frames),as shown in Figure 2(b). We use the Cosine similarity be-tween two pitch sequences as the baseline similarity.

(a) Generated pitch sequence.

(b) Original pitch sequence of clip 1.

Figure 5: An identical pitch sequence generated by thetrained RNN.

Matrix Dimensioncell 0/gru cell/gates/kernel (101, 128)cell 0/gru cell/gates/bias (128)cell 0/gru cell/candidate/kernel (101, 64)cell 0/gru cell/candidate/bias (64)cell 1/gru cell/gates/kernel (128, 128)cell 1/gru cell/gates/bias (128)cell 1/gru cell/candidate/kernel (128, 64)cell 1/gru cell/candidate/bias (64)fully connected/weights (64, 37)fully connected/biases (37)all parameters 46,757

Table 3: Parameter dimensions.

4. RESULTS ANALYSIS

4.1 Evaluation metric and results

In the subjective similarity study, each clip is compared to4-6 other clips, usually 5 clips [10]. For example, clip 3 iscompared to clips as shown in Table 4(a). We measure thesimilarity of two clips by computing the Cosine similar-ity between their RNN parameters. We compare the rankof votes to the rank of similarities for evaluation. For ex-ample, as shown in Table 4(a), 8 people vote the melodyof clip 80 is similar to that of clip 3, and 7 people votethe similarity between clip 29 and clip 3. Based on thesevotes we assume clip 80 is more similar to clip 3 than clip29. Thus, the Cosine similarity between clip 80 and clip3 should be larger than that between clip 29 and clip 3C(80, 3) > C(29, 3). We first convert the similarity andvotes into ranks (as shown in Table 4(b)), and then use thepair-wise evaluation metric–Kendall’s tau (τ )– to comparethe ranks. For clip 3, the τ is 0.2 based on similarities be-tween RNN parameters, better than τ = −0.2 based on

766 Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018

Page 5: COMPARING RNN PARAMETERS FOR MELODIC SIMILARITYismir2018.ircam.fr/doc/pdfs/61_Paper.pdf · 2018-09-13 · COMPARING RNN PARAMETERS FOR MELODIC SIMILARITY Tian Cheng, Satoru Fukayama,

(a) Parameters of the initialisation RNN

(b) Parameters of the RNN trained on clip 3

(c) Parameters of the RNN trained on clip 80

Figure 6: Parameters of different RNNs with subtle differ-ences.

similarities between pitch sequences.The results for 200 pairs of clips are shown in Table 5.

The average τs are 0.125 and 0.073 based on Cosine sim-ilarities between RNN parameters and between pitch se-quences, respectively. 4 In the preliminary test, we foundthat there is no improvement in performance by using adimension-reducing technique, such as Principle Compo-nent Analysis (PCA), before computing Cosine similarity,or by using distances between eigenvectors (weighted byeigenvalues) of parameter matrices.

4.2 Visualisation

4.2.1 Similarity v.s. vote

We assume if there are more votes on X than on Y whencomparing to A, then the X should be more similar to Athan Y. However, this may be too strict when votes areclose (8 on X and 7 on Y, for example). In order to showwhether there is a trend that the similarity value is largerfor pairs of clips with a higher vote in general, we showCosine similarity v.s. vote plots for RNN parameters andbaseline pitch sequences in Figure 7.

We know the RNN parameters of different clips are verysimilar to each other, as shown in Figure 6. Therefore, the

4 Using the Euclidean distance provides similar results as using the Co-sine similarity: 0.120 and 0.074 for RNN parameters and pitch sequences,respectively.

No. 80 29 59 62 5

Votes 8 7 6 4 3

CRNN 0.9975 0.9973 0.99717 0.9976 0.99725

Cpitch 0.6175 0.7146 0.6256 0.7097 0.6584

(a) Cosine similarities between parameters of clips compared to clip 3.

No. 80 29 59 62 5 τ

RVotes 1 2 3 4 5

RRNN 2 3 5 1 4 0.2

Rpitch 5 1 4 2 3 -0.2

(b) Ranks of Cosine similarities.

Table 4: Evaluation for clip 3. CRNN and Cpitch are theCosine similarities between parameters and between pitchsequences, respectively.

Similarity τ

CRNN 0.125

Cpitch 0.073

Table 5: Results.

Cosine similarities between RNN parameters are in a smallrange from 0.995 to 0.999 (Figure 7(a)). The Cosine sim-ilarities between melodic sequences are in a larger rangefrom 0.4 to 0.9 (Figure 7(b)). However, neither RNN pa-rameters nor melodic sequences provide a clear trend ofthe similarity increasing with number of votes.

4.2.2 t-SNE

To visualise the 80 songs in a low-dimensional space, wefirst reduce the dimension of the features to 5 by PCA, thenfurther reduce it to 2 by t-SNE, with the implementationof [20]. The visualisation based on RNN parameters andpitch sequences is shown in Figure 8. For a clearer visu-alisation, we only indicate pairs of clips with higher votes(above 9 votes out of 27, as listed in Table 6) by connectingthose pairs with lines.

Because the t-SNE visualisation is not a linear pro-jection from the similarity to the distance on the 2-dimensional space, we do not compare the vote against thedistance between two clips in t-SNE visualisation, but fo-cus on the grouping of clips. We observe some interestinggrouping of clips in Figure 8(a): the triangle at the top leftfor (75, 79, 80), and two lines at bottom right connecting(15, 16) and (6,16). In Figure 8(b), no such grouping ofclips can be obviously observed.

5. DISCUSSIONS AND CONCLUSIONS

From the t-SNE visualisation, we observe some interestinggrouping of clips based on RNN parameters (Figure 8(a)).However, visualisation based on the Cosine similarity be-tween RNN parameters does not show a clear relation be-tween the similarity and the vote (Figure 7(a)). It may

Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 767

Page 6: COMPARING RNN PARAMETERS FOR MELODIC SIMILARITYismir2018.ircam.fr/doc/pdfs/61_Paper.pdf · 2018-09-13 · COMPARING RNN PARAMETERS FOR MELODIC SIMILARITY Tian Cheng, Satoru Fukayama,

0 5 10 15 20

Number of votes

0.995

0.9955

0.996

0.9965

0.997

0.9975

0.998

0.9985

0.999

Co

sin

e s

imila

rity

be

twe

en

RN

N p

ara

me

ters

(a) Visualisation based on RNN parameters

0 5 10 15 20

Number of votes

0.4

0.5

0.6

0.7

0.8

0.9

Co

sin

e s

imila

rity

be

twe

en

pitch

se

qu

en

ce

s

(b) Visualisation based on pitch sequences

Figure 7: Similarity v.s. vote plot based on different fea-tures.

indicate that a direct comparison between RNN parame-ters is too simple to infer the information in such a largedimension. Figure 6 also illustrates the difficulties withthe proposed approach, too many parameters with sub-tle differences. We would like to dig deeper to under-stand which parameters are most significant for computingmelodic similarity.

Perception studies show that changes in relative scaleor relative duration do not have a major impact on melodicsimilarity [24]. The similarity measure should be invariantto music transformations, such as transposition in pitch andtempo changes [16,23]. The proposed generative RNN canmodel the input pitch sequence, but cannot deal with the

No. Pair Vote No. Pair Vote No. Pair Vote1 (79, 80) 23 11 (10, 63) 13 21 (10, 52) 112 (47, 68) 19 12 (47, 76) 13 22 (7, 20) 103 (65, 78) 18 13 (51, 63) 13 23 (7, 45) 104 (6, 16) 17 14 (51, 77) 13 24 (29, 60) 105 (12, 47) 16 15 (64, 66) 13 25 (47, 67) 106 (12, 63) 16 16 (7, 49) 12 26 (70, 71) 107 (15, 16) 16 17 (19, 20) 12 27 (75, 79) 108 (67, 75) 16 18 (41, 43) 12 28 (75, 80) 109 (54, 63) 15 19 (42, 44) 12

10 (72, 75) 15 20 (68, 72) 12

Table 6: A list of pairs of songs with similarity votes above9 votes out of 27.

-40 -30 -20 -10 0 10 20 30 40 50

-60

-40

-20

0

20

40

60

80

1

23

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

2930

31

32

33

34

35

3637

38

39

40

4142

43

44

45

46

47

48

49

50 51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

7071

72

73

74

75

76

77

78

79

80

(a) Visualisation based on RNN parameters

-800 -600 -400 -200 0 200 400 600

-400

-300

-200

-100

0

100

200

300

400

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

3132

3334

35

36

37

38

39

40

4142

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

6061

62

63

64

65

66

67

68

69

70

71

72

73

74

75

7677

78

79

80

(b) Visualisation based on pitch sequences

Figure 8: t-SNE visualisation based on different features.

similarity under music transformations. In the future, wewould like to tackle this problem by training RNNs withcoordinate differences instead of absolute coordinates asinputs, such as intervals and durations instead of pitchesand onsets [16].

We work on the melodic similarity based on theperformance-based representation of melodies, whichseems to complicate the task. We hope we can achievemore success on symbolic melody representation by usingscore-based representation on a simpler dataset.

In this paper, we propose to represent a melodic se-quence by the parameters of its corresponding generativeRNN, and test the utility of the melodic feature (RNN pa-rameters) in the melodic similarity task. The proposed fea-ture contains temporal information within the melodic se-quence, and independent of the length of the sequence. Weextend the utility of generative RNNs to use the networkfor music similarity analysis rather than music generation.We expect that the proposed feature (generative RNN pa-rameters) can be used in other tasks, such as musicologicalanalysis and music cognition.

768 Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018

Page 7: COMPARING RNN PARAMETERS FOR MELODIC SIMILARITYismir2018.ircam.fr/doc/pdfs/61_Paper.pdf · 2018-09-13 · COMPARING RNN PARAMETERS FOR MELODIC SIMILARITY Tian Cheng, Satoru Fukayama,

6. ACKNOWLEDGEMENT

This work was supported in part by JST ACCEL GrantNumber JPMJAC1602, Japan.

7. REFERENCES

[1] Magenta: Melody RNN. https://github.com/tensorflow/magenta/tree/master/magenta/models/melody_rnn. Accessed:2018-03-30.

[2] D. Bountouridis, D. G. Brown, F. Wiering, and R. C.Veltkamp. Melodic Similarity and Applications UsingBiologically-Inspired Techniques. Applied Sciences,7(12), 2017.

[3] G. Brunner, Y. Wang, R. Wattenhofer, and J. Wiesen-danger. JamBot: Music Theory Aware Chord BasedGeneration of Polyphonic Music with LSTMs. In Pro-ceedings of the 29th International Conference on Toolswith Artificial Intelligence (ICTAI), 2017.

[4] F. Colombo, S. P. Muscinelli, A. Seeholzer, J. Brea,and W. Gerstner. Algorithmic Composition ofMelodies with Deep Recurrent Neural Networks. Com-puting Research Repository (CoRR), abs/1606.07251,2016.

[5] J. S. Downie. Evaluating a Simple Approach to Mu-sical Information retrieval: Conceiving Melodic N-grams as Text. PhD thesis, University of Western On-tario, 1999.

[6] M. Goto. AIST Annotation for the RWC MusicDatabase. In Proceedings of the 7th International Con-ference on Music Information Retrieval (ISMIR), pages359–360, 2006.

[7] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka.RWC Music Database: Popular, Classical, and JazzMusic Databases. In Proceedings of the 3rd Interna-tional Conference on Music Information Retrieval (IS-MIR), pages 287–288, 2002.

[8] M. Grachten, J.-L. Arcos, and R. L. de Mantaras.Melodic Similarity: Looking for a Good AbstractionLevel. In Proceedings of the 5th International Societyof Music Information Retrievall (ISMIR), 2004.

[9] P. Hanna, P. Ferraro, and M. Robine. On Optimiz-ing the Editing Algorithms for Evaluating SimilarityBetween Monophonic Musical Sequences. Journal ofNew Music Research, 36(4):267–279, 2007.

[10] S. Kawabuchi, C. Miyajima, N. Kitaoka, andK. Takeda. Subjective Similarity of Music: Data Col-lection for Individuality Analysis. In Proceedings ofAsia Pacific Signal and Information Processing As-sociation Annual Summit and Conference, pages 1–5,2012.

[11] Q. Le and T. Mikolov. Distributed Representationsof Sentences and Documents. In Proceedings of the28th International Conference on Machine Learning(ICML), 2014.

[12] S. Madjiheurem, L. Qu, and C. Walder. Chord2Vec:Learning Musical Chord Embeddings. In Proceedingsof the Constructive Machine Learning Workshop at30th Conference on Neural Information ProcessingSystems (NIPS), 2016.

[13] T. Mikolov, K. Chen, G. S. Corrado, and J. Dean. Ef-ficient Estimation of Word Representations in VectorSpace. In Proceedings of International Conference onLearning Representations (ICLR) Workshop, 2013.

[14] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado,and J. Dean. Distributed Representations of Words andPhrases and their Compositionality. In Proceedings ofAdvances in Neural Information Processing Systems26 (NIPS), pages 3111–3119, 2013.

[15] D. Mullensiefen and K. Frieler. Optimizing Measuresof Melodic Similarity for the Exploration of a LargeFolk Song Database. In Proceedings of the 5th Interna-tional Society of Music Information Retrievall (ISMIR),pages 1–7, 2004.

[16] D. Mullensiefen and K. Frieler. Evaluating DifferentApproaches to Measuring the Similarity of Melodies.In et al. V. Batagelj, editor, Data Science and Classifi-cation, pages 299–306. Springer, Berlin, 2006.

[17] K. S. Orpen and D. Huron. Measurement of Simi-larity in Music: A Quantitative Approach for Non-parametric Representations. Computers in Music Re-search, 4:1 – 44, 1992.

[18] H. Palangi, P. li, Y. Shen, J. Gao, X. He, J. Chen,X. Song, and R. Ward. Deep Sentence EmbeddingUsing Long Short-Term Memory Networks: Analysisand Application to Information Retrieval. IEEE/ACMTransactions on Audio, Speech and Language Process-ing (TASLP), 24(4):694–707, 2016.

[19] M. W. Park and E. C. Lee. Similarity MeasurementMethod between Two Songs by Using the ConditionalEuclidean Distance. Wseas Transaction On Informa-tion Science And Applications, 10(12), 2013.

[20] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duches-nay. Scikit-Learn: Machine Learning in Python. Jour-nal of Machine Learning Research, 12:2825–2830,2011.

[21] R. Socher, C. Y. Lin, A. Y. Ng, and C. D. Manning.Parsing Natural Scenes and Natural Language with Re-cursive Neural Networks. In Proceedings of Interna-tional Conference on Machine Learning (ICML), 2011.

Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 769

Page 8: COMPARING RNN PARAMETERS FOR MELODIC SIMILARITYismir2018.ircam.fr/doc/pdfs/61_Paper.pdf · 2018-09-13 · COMPARING RNN PARAMETERS FOR MELODIC SIMILARITY Tian Cheng, Satoru Fukayama,

[22] B. L. Sturm, J. F. Santos, and I. Korshunova. FolkMusic Style Modelling by Recurrent Neural Networkswith Long Short Term Memory Units. In Extended ab-stracts for the Late-Breaking Demo Session of the 16thInternational Society for Music Information RetrievalConference (ISMIR), 2015.

[23] J. Urbano, J. Llorens, J. Morato, and S. Sanchez-Cuadrado. MIREX 2012 Symbolic Melodic Similarity:Hybrid Sequence Alignment with Geometric Repre-sentations. In Music Information Retrieval EvaluationeXchange (MIREX), 2012.

[24] M. R. Velankar, H. V. Sahasrabuddhe, and P. A. Kulka-rni. Modeling Melody Similarity Using Music Syn-thesis and Perception. Procedia Computer Science,45:728 – 735, 2015.

[25] V. Velardo, M. Vallati, and S. Jan. Symbolic MelodicSimilarity: State of the Art and Future Challenges.Computer Music Journal, 40(2):70–83, 2016.

[26] J. Wu, C. Hu, Y. Wang, X. Hu, and J. Zhu. A Hierar-chical Recurrent Neural Network for Symbolic MelodyGeneration. Computing Research Repository (CoRR),abs/1712.05274, 2017.

[27] S. Yazawa, Y. Hasegawa, K. Kanamori, andM. Hamanaka. Melodic Similarity Based on ExtensionImplication-Realization Model. In Music InformationRetrieval Evaluation eXchange (MIREX), 2013.

[28] Y. Zhu, M. Kankanhalli, and Q. Tian. SimilarityMatching of Continuous Melody Contours for Hum-ming Querying of Melody Databases. In Proceedingsof IEEE Workshop on Multimedia Signal Processing,2002.

770 Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018