maestro : deep-learning approaches to melody reductionjuhan/gct634/2019/finals/maestro...this is...

MAESTRO : Deep-Learning Approaches to Melody Reduction

Cheol-gyu Jin Gyu-young Kwauk Electrical Engineering, KAIST

[email protected] Electrical Engineering, KAIST [email protected]

ABSTRACT

We are exposed to a great deal of music. As a result, many people have a desire to learn to play music. How-ever, it takes a long time to master the instrument. Many people give up in this process. We want to provide a sense of accomplishment by providing an easy musical score of the song they want. If so, more people will be able to continue learning. We used the GTTM dataset to create an easy score ground truth dataset. We propose a LSTM model that provides easy score.

1. INTRODUCTION

We can easily find a way to play in our daily life. In this process, most people hope to learn to play. But in some cases it takes a long time to play the song we want. Take the piano as an example. When we start to learn the pi-ano, we practice a lot of songs from beyer to czerny. But the songs for practice are songs we do not know. That's why most people lose interest in the process of learning to play. This is what we think is a problem. We thought we should provide people with a small achievement in this process. In other words, we can provide motivation to achieve by playing the song you want. Our goal is to provide beginners with just an easy musical score for the songs they want to learn. The purpose of the paper is to reduce the number of people giving up in the process of learning to play and to make it easier for more and more people to learn to play.

2. REALATED WORK

2.1 GTTM Dataset

GTTM is composed of four modules, each of which as-signs a separate structural description to a listener’s un-derstanding of a piece of music. These four modules re-spectively output a grouping structure, a metrical struc-ture, a time-span tree, and a prolongational tree. The grouping structure is intended to formalize the intuitive belief that tonal music is organized into groups that are in turn composed of subgroups. These groups are graphical-ly presented as several levels of arcs below a musical staff. The metrical structure describes the rhythmical hi-erarchy of the piece by identifying the position of strong beats at the levels of a quarter note, half note, a measure, two measures, four measures, and so on. Strong beats are illustrated as several levels of dots below the staff. A time-span tree is a binary tree, which is a hierarchical structure describing the relative structural importance of

notes that differentiate the essential parts of the melody from the ornamentation.

Figure 1. GTTM Analyzer

2.2 Melody Reduction

The time-span reduction represents an intuitive idea: if we remove ornamental notes from a long melody, we ob-tain a simpler, similar sounding melody. An entire piece of tonal music can eventually be reduced to one important note.

Figure 2. Some reduction rules

2.3 Recurrent Neural Network(RNN)

Recurrent Neural Network is the neural network that dealing with sequential data. RNNs are called ‘recurrent’ because they perform the same task for every element of a sequence with the output being depended on the previ-ous computations. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far.

Figure 3. RNN architecture

Above figure shows RNNs architecture. X" is the input at time step t. And s" is the hidden state at time step t, which is ‘memory’ of the network. s" is calculated based on the previous hidden state and the input at the current step(Eqn(1)).

s" = 𝑓(𝑈𝑥) +𝑊𝑠)-.) (1)

o" is the output at step t. The function f is the activation function like ReLU or tanh.

RNNs have shown success in many problems such as speech recognition, language modeling etc..

2.4 Long Short-term Memory(LSTM) Network

Long Short-term Memory Network is a special kind of RNN, capable of learning long-term dependency. LSTMs are explicitly designed to avoid long-term dependency problem.

Figure 4. LSTM architecture

In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer. LSTMs also have this chain like structure, but the repeating mod-ule has a different structure. Instead of having a single neural network layer, there are four, interacting each oth-er.

3. METHOD

3.1 Data Preparation

3.1.1 MusicXML

The attributes element contains key information needed to interpret the notes and musical data that follow in this part.

Each note in MusicXML has a duration element. The divisions element provided the unit of measure for the duration element in terms of divisions per quarter note. Since all we have in this file is one whole note, we never have to divide a quarter note, so we set the divisions val-ue to 1.

Musical durations are typically expressed as fractions, such as "quarter" and "eighth" notes. MusicXML dura-tions are fractions, too. Since the denominator rarely needs to change, it is represented separately in the divi-sions element, so that only the numerator needs to be as-sociated with each individual note. This is similar to the scheme used in MIDI to represent note durations.

The key element is used to represent a key signature. Here we are in the key of C major, with no flats or sharps, so the fifths element is 0. If we were in the key of D ma-jor with 2 sharps, fifths would be set to 2. If we were in the key of F major with 1 flat, fifths would be set to -1. The name "fifths" comes from the representation of a key signature along the circle of fifths. It lets us represent standard key signatures with one element, instead of sep-arate elements for sharps and flats.

The time element represents a time signature. Its two component elements, beat and beat-type, are the numera-tor and denominator of the time signature, respectively.

The pitch element must have a step and an octave ele-ment. Optionally it can have an alter element, if there is a flat or sharp involved. These elements represent sound, so the alter element must always be included if used, even if the alteration is in the key signature. In this case, we have no alteration. The pitch step is C. The octave of 4 indi-cates the octave the starts with middle C. Thus this note is a middle C. After understanind the structure of mu-sicXML, we implemented parser using python. We de-cided to use 9 features for each note..

3.2 Many-to-many classification

Figure 5. Model logic diagram

Figure above(Fig. 5) is our entire logic diagram. By method mentioned by 3.1, we use Musical XML file as the input of LSTM. The LSTM models classified each node to delete or not. For example, when classifier select 0 for some node it means that selected node will deleted in reduced melody. The LSTM model should classify

each node, we select many-to-many LSTM classification model.

Figure 6. proposed LSTM architecture

We use results of rule-based melody reduction algo-rithms as ground-truth.

3.3 Generation of reduced song

For given label, we generate reduced sheet in musical XML format. Deleted nodes depicted shown below. Re-mained nodes have same state with input nodes, and de-leted nodes acts like ‘rest’.

Figure 7. Deleted nodes

4. TRAINING

4.1 Training and Loss Function

This is training algorithm for our model. For each batch, initialize hidden state for computing results by each song. And for each step(node), put hidden state and input data as input of LSTM. Then, calculate loss from each step and get sum for all steps in a batch. And update model parameters to minimize the loss using Adam optimizer.

Figure 8. Cross-Entropy Loss

4.2 Classification

Outputs of LSTMs are hidden state and prediction. Pre-diction is computed by fully-connected network and Soft-max function. Prediction vector is [batch_size, 2] size vector, so we can predict input data’s label.

5. EVALUATION

5.1 Loss and accuracy

We plot the loss and accuracy with tensorboard. X-dimension is the number of epoch and y-dimension is target value. Our training set is 80 songs in GTTM da-tasets, and test set is 20 songs. Accuracy is computed to compare each nodes. We select the model that has best results on test set to avoid overfitting.

Figure 9. train loss and accuracy

Figure 10. Test loss and accuracy

5.2 Output

This is some results of the easy-sheet generation. We get input and run our model to classify for each nodes, and generate easy-music sheet.

Figure 11. Wiegenlied, Original(top), reduced by base-line(middle), reduced by our model(bottom)

Figure 12. Moments Musicaux, Original(top), reduced by baseline(middle), reduced by our model(bottom)

Figure 13. Gianni Schicchi O mio babbino caro, Origi-nal(top), reduced by baseline(middle), reduced by our

model(bottom)

6. CONCLUSION

We can check our model is well-trained, so our model runs very similar to base line algorithms. For this reason, our results produce well reduced melody when base-line algorithm produces well. It means our LSTM train the reduction rules well. From this consideration, we think that our model will work well with a lot of data (per-

formed by human). There are some limitation in our model, it will discussed in future works.

7. FUTURE WORKS

7.1 Melody reduction by level

There are some results that show over-reduced. Over-reduced means that the reduction offers a lot so the re-duced song lost their identity. To deal with problem we can make “level” for reduction. The model suggests vari-ous reduced song that have different reduction level. Also, we can provide the level-by-level music sheet, that will very helpful to beginner.

Figure 14. Over-reduced song

7.2 Unsupervised learning

Our model also works like rule-based algorithms, so sometimes it works great according to rule but the re-sults are not reasonable in the view point of human-cognition. We think that it works better with unsuper-vised-learning(like clustering) which learns human-cognition from wave form of music.

8. AUTHOR CONTRIBUTIONS

“Gyuyoung” implemented the LSTM model and training algorithms. “Cheol-gyu” collected the dataset and imple-ments Musical XML reader, easy sheet generator.

9. REFERENCES

[1] Ryan Groves : AUTOMATIC MELODIC REDUCTION USING A SUPERVISED PROBABILISTIC CONTEXT-FREE GRAMMAR

[2] Masatoshi Hamanaka : Polyphonic Music Analysis Database Based on GTTM.

[3] Alex Sherstinsky : Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network.

[4] Hamanaka, Masatoshi, and Satoshi Tojo : "Interactive GTTM Analyzer." ISMIR. 2009.

maestro : deep-learning approaches to melody reductionjuhan/gct634/2019/finals/maestro...this is...

Documents