* [email protected] arxiv:1811.12802v1 [cs.ir] 29 nov 2018

25
Naive Dictionary On Musical Corpora: From Knowledge Representation To Pattern Recognition Qiuyi Wu 1,* , Ernest Fokoué 1 1 School of Mathematical Science, Rochester Institute of Technology, Rochester, New York, USA * [email protected] Abstract In this paper, we propose and develop the novel idea of treating musical sheets as literary docu- ments in the traditional text analytics parlance, to fully benefit from the vast amount of research already existing in statistical text mining and topic modelling. We specifically introduce the idea of representing any given piece of music as a collection of "musical words" that we codenamed "muse- lets", which are essentially musical words of various lengths. Given the novelty and therefore the extremely difficulty of properly forming a complete version of a dictionary of muselets, the present paper focuses on a simpler albeit naive version of the ultimate dictionary, which we refer to as a Naive Dictionary because of the fact that all the words are of the same length. We specifically herein construct a naive dictionary featuring a corpus made up of African American, Chinese, Japanese and Arabic music, on which we perform both topic modelling and pattern recognition. Although some of the results based on the Naive Dictionary are reasonably good, we anticipate phenomenal predictive performances once we get around to actually building a full scale complete version of our intended dictionary of muselets. 1 Introduction Music and text are similar in the way that both of them can be regraded as information carrier and emotion deliverer. People get daily information from reading newspaper, magazines, blogs etc., and they can also write diary or personal journal to reflect on daily life, let out pent up emotions, record ideas and experience. Composers express their feelings through music with different com- binations of notes, diverse tempo 1 , and dynamics levels 2 , as another version of language. This paper explores various aspects of statistical machine learning methods for music mining with a concentration on music pieces from Jazz legends like Charlie Parker and Miles Davis. We attempt to create a Naive Dictionary analogy to the language lexicon. That is to say, when people hear a mu- sic piece, they are hearing the audio of an essay written with "musical words", or "muselets". The target of this research work is to create homomorphism between musical and literature. Instead of decomposing music sheet into a collection of single notes, we attempt to employ direct seamless 1 In musical terminology, tempo ("time" in Italian), is the speed of pace of a given piece. 2 In music, dynamics means how loud or quiet the music is. 1/25 arXiv:1811.12802v1 [cs.IR] 29 Nov 2018

Upload: others

Post on 07-Nov-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Naive Dictionary On Musical Corpora:From Knowledge Representation To Pattern Recognition

Qiuyi Wu1,*, Ernest Fokoué1

1 School of Mathematical Science, Rochester Institute of Technology, Rochester, New York,USA

* [email protected]

Abstract

In this paper, we propose and develop the novel idea of treating musical sheets as literary docu-ments in the traditional text analytics parlance, to fully benefit from the vast amount of researchalready existing in statistical text mining and topic modelling. We specifically introduce the idea ofrepresenting any given piece of music as a collection of "musical words" that we codenamed "muse-lets", which are essentially musical words of various lengths. Given the novelty and therefore theextremely difficulty of properly forming a complete version of a dictionary of muselets, the presentpaper focuses on a simpler albeit naive version of the ultimate dictionary, which we refer to as aNaive Dictionary because of the fact that all the words are of the same length. We specifically hereinconstruct a naive dictionary featuring a corpus made up of African American, Chinese, Japaneseand Arabic music, on which we perform both topic modelling and pattern recognition. Althoughsome of the results based on the Naive Dictionary are reasonably good, we anticipate phenomenalpredictive performances once we get around to actually building a full scale complete version ofour intended dictionary of muselets.

1 Introduction

Music and text are similar in the way that both of them can be regraded as information carrier andemotion deliverer. People get daily information from reading newspaper, magazines, blogs etc.,and they can also write diary or personal journal to reflect on daily life, let out pent up emotions,record ideas and experience. Composers express their feelings through music with different com-binations of notes, diverse tempo1, and dynamics levels2, as another version of language.

This paper explores various aspects of statistical machine learning methods for music mining witha concentration on music pieces from Jazz legends like Charlie Parker and Miles Davis. We attemptto create a Naive Dictionary analogy to the language lexicon. That is to say, when people hear a mu-sic piece, they are hearing the audio of an essay written with "musical words", or "muselets". Thetarget of this research work is to create homomorphism between musical and literature. Instead ofdecomposing music sheet into a collection of single notes, we attempt to employ direct seamless

1In musical terminology, tempo ("time" in Italian), is the speed of pace of a given piece.2In music, dynamics means how loud or quiet the music is.

1/25

arX

iv:1

811.

1280

2v1

[cs

.IR

] 2

9 N

ov 2

018

adaptation of canonical topic modeling on words in order to "topic model" music fragments.

One of the most challenging components is to define the basic unit of the information from whichone can formulate a soundtrack as a document. Specifically, if a music soundtrack were to beviewed as a document made up of sentences and phrases, with sentences defined as a collection ofwords (adjectives, verbs, adverbs and pronouns), several topics would be fascinating to explore:

• What would be the grammatical structure in music?

• What would constitute the jazz lexicon or dictionary from which words are drawn?

All music is story telling as assumption. It is plausible to imagine every piece of music as a collec-tion of words and phrases of variable lengths with adverbs and adjectives and nouns and pronouns.

ϕ : musical sheet → bag of music words

The construction of the mapping ϕ is non-trivial and requires deep understanding of music theory.Here several great musicians offer insights on the complexity of ϕ from their perspectives, to ex-plain about the representation of the input space, namely, creating a mapping from music sheet tocollection of music "words" or "phrases":

• "These are extremely profound questions that you are asking here. I think I’m interested in trying. Butyou have opened up a whole lot of bigger questions with this than you could possibly imagine." (Dr.Jonathan Kruger, personal communication with Dr. Ernest Fokoue, November 24, 2018).

• "Your music idea is fabulous but are you sure that nothing exists? Do you know "band in a box? Itis a software in which you put a sequence of chords and you get an improvisation ’à la manière de’.You choose amongst many musicians so they probably have the dictionary to play as Miles, Coltrane,Herbie, etc." (Dr. Evans Gouno, personal communication with Dr. Ernest Fokoue, November05, 2018).

• Rebecca Ann Finnangan Kemp mentioned building blocks of music when it comes to music words idea.(personal communication with Dr. Ernest Fokoue, November 20, 2018).

The concept of notes is equivalent to alphabet, which can be extended as below:

• literature word ≡mixture of the 26 alphabets

• music word ≡mixture of the 12 musical notes

Since notes are fundamental, one can reasonably consider input space directly isomorphic to the 12notes.

2 Related Work

Table 1. Comparison between Text and Music in Topic ModelingText letter word topic document corpus

Music note notes* melody song album* a series of notes in one bar can be regarded as a "word"

2/25

Figure 1. Piece of Music Melody

Compared with the role of text in Topic Modeling as showed in Table 1, we treat a series of notes as"word", can also be called as "term", as single note could not hold enough information for us to in-terpret, specifically, we treat notes in one bar3 as one "term". Melody4 plays the role of "topic", andthe melodic materials give the shape and personality of the music piece. "Melody" is also referred as"key-profile" by Hu and Saul [2009a] in their paper, and this concept was based on the key-findingalgorithm from Krumhansl and Schmuckler [1990] and the empirical work from Krumhansl andKessler [1982]. The whole song is regarded as "document" in text mining, and a collection of songscalled album in music could be regarded as "corpus" in text mining.

Figure 2. Circle of Fifths (left) and Key-profiles (right)

Specifically, "key-profile" is chromatic scale showed geometrically in Figure 2 Circle of Fifths plotcontaining 12 pitch classes in total with major key and minor key respectively, thus there are to-tally 24 key-profiles, each of which is a 12-dimensional vector. The vector in the earliest model inLonguet-Higgins and Steedman [1971] uses indicator with value of 0 and 1 to simply determine thekey of a monophonic piece. E.g. C major key-profile:

[1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1]

As showed in the figures below, Krumhansl and Schmuckler [1990] judge the key in a more robustway. Elements in the vector indicate the stability of each pitch-class corresponding to each key.

3In musical notation, a bar (or measure) is a segment of time corresponding to a specific number of beats in which eachbeat is represented by a particular note value and the boundaries of the bar are indicated by vertical bar lines.

4Harmony is formed by consecutive notes so that the listener automatically perceives those notes as a connected seriesof notes.

3/25

Melody in the same key-profile would have similar set of notes, and each key-profile is a distribu-tion over notes.

Figure 3 shows the pitch-class distribution of C Major Piano Sonata No.1, K.279/189d (Mozart, Wolf-gang Amadeus) using K-S key-finding algorithm, and we can see all natural notes: C, D, E, F, G, A,B have high probability to occur than other notes. Figure 4 shows the pitch-class distribution of CMinor BWV.773 No. 2 in C minor (Bach, Johann Sebastian) and again we can see specific notes typicalfor C Minor with higher probability: C, D, D], F, G, G], and A].

Figure 3. C major key-profile Figure 4. C minor key-profile

Usually different scales could bring different emotions. Generally, major scale arouse buoyant andupbeat feelings while minor scales create dismal and dim environment. Details for emotion andmood effects from musical keys would be presented in later section.

3 Representation

We mainly studied symbolic music in mxl format in this research work. The data are collected fromMuseScore2 containing music pieces from different musicians and genres. Specifically, we collectmusic pieces from 3 different music genres, i.e.: Chinese songs, Japanese songs, Arabic songs. ForJazz music we collect work from 7 different musicians, i.e.: Duke Ellington, Miles Davis, JohnColtrane, Charlie Parker, Louis Armstrong, Bill Evans, Thelonious Monk.

• Transfer mxl file to xml file

• Use mxl files to extract notes in each measure

• Create matrices based on the extracted notes2MuseScore: https://musescore.org/en

4/25

Figure 5. Transforming Notes from Music Sheets to Matrices

Based on the concept of duration (the length of time a pitch/ tone is sounded), and in each mea-sure the duration is fixed, we can create Measure-Note matrices. In Measure-Note matrices, we useletter C, D, E, F, G, A, B to denote the notes from "Do" to "Si", "flat" and "sharp" to denote [ and ],and "O" to denote the rest3.

As demonstrated above, for Jazz part we mainly studied work from 7 Jazz musicians (Duke Elling-ton, Miles Davis, John Coltrane, Charlie Parker, Louis Armstrong, Bill Evans, Thelonious Monk),and for the comparison with other music genres we focused on Chinese, Japanese, and Arabicmusic. So we created two different albums based on the Measure-Note matrices we generated inprevious Step. I use two different ways to demonstrate the album.

3.1 Note-Based Representation

Figure 6. Music Key

Based on the 12 keys (5 black keys + 7 white keys) in the Figure 6, I make note-based representationaccording to the pitch class in Table 2: forsaking the order of notes, we describe each measure inthe song as a 12-dimension binary vector X = [x1, x2, ...x12], where xi ∈ 0, 1 (Table 3)

3A rest is an interval of silence in a piece of music.

5/25

Table 2. Pitch Class

Pitch Class Tonal Counterparts Solfege

1 C, B] do2 C], D[3 D re4 D], E[5 E, F[ mi6 F , E] fa7 F], G[8 G sol9 G], A[

10 A la11 A], B[12 B, C[ ti

Table 3. Notes collection from 4 Music Genres

Document Pitch Class Genre

China 1 0 0 0 0 1 0 1 0 0 0 0 1 ChinaChina 2 0 0 0 0 1 0 1 0 0 0 0 0 ChinaChina 3 0 0 0 0 0 0 1 0 0 0 0 1 China

......

...China 7 0 1 0 0 1 0 1 0 0 0 0 1 ChinaChina 8 0 0 0 0 1 0 1 0 0 0 0 1 China

......

...Japan 1 1 0 1 1 0 0 1 0 0 0 0 0 JapanJapan 2 1 0 0 0 0 0 0 1 0 0 0 0 Japan

......

...

• Document: song names, tantamount to document in text mining

• Pitch Class: binary vector whose element indicates if certain note is on, tantamount to wordin text mining

• Genre: labeled contain Chinese songs, Japanese songs, Arabic songs, to compare with Jazzsongs later

• The dimension of this data frame is 1469× 3

Create the document term matrix (DTM) whose cells reflect the frequency of terms in each docu-ment. The rows of the DTM represent documents and columns represent term in the corpus. Ai,jcontains the number of times term j appeared in document i.

6/25

Table 4. Document Term Matrix

TermDocument 000000000000 000000000100 000000010100 ...

Arab 5 15 6 20 ...Arab 7 0 5 5 ...China 6 1 12 0 ...China 7 13 0 1 ...Japan 4 8 4 1 ...Japan 5 0 0 0 ...USA 4 2 1 0 ...

......

......

3.2 Measure-Based Representation

Table 5. Notes collection from 7 musicians

Document Notes Musician

Charlie 1 B[ O O O O O O O CharlieCharlie 1 B B[ A A[ G G G[ F CharlieCharlie 1 E F G[ B[ G G A[ O Charlie

......

...Charlie 7 E E E E G G C O CharlieCharlie 8 F] O O O O O O O Charlie

......

...Duke 1 C C C G G G G G DukeDuke 1 F F F A[ A[ A[ B[ B[ Duke

......

...

• Document: song names, tantamount to document in text mining

• Notes: a series of notes in one measure, tantamount to word in text mining

• Musician: the composer, tantamount to the label for later analysis

• The dimension of this data frame is 5149× 3

Create the document term matrix (DTM) whose cells reflect the frequency of terms in each docu-ment. The rows of the DTM represent documents and columns represent term in the corpus. Ai,jcontains the number of times term j appeared in document i. Dimension of DTM is 83× 2960 withthe last column as label: Duke, Miles, John, Charlie, Louis, Bill, Monk.

7/25

Table 6. Document Term Matrix

TermDocument O O O O O O O O B D B B D D E E C A A] B D C A O ...

Miles 6 40 0 0 ...Louis 2 32 0 0 ...Sonny 3 26 0 0 ...Miles 2 25 0 0 ...Duke 4 0 9 0 ...Sonny 4 14 0 0 ...Charlie 9 0 0 8 ...

......

......

We can also talk a close look at the most frequent terms in the whole album: terms appear morethan 20 times:

Table 7. Most Frequent Terms

Term

O O O O O O O OC C C C C C C C

A A A A O O O OB[ B[ B[ B[ B[ B[ B[ B[

B B B B B B B BD D D D D D D DG G G G G G G GA A A A A A A A

4 Pattern Recognition

We take the topic proportion matrix as input and employ it on machine learning techniques forclassification. We conduct the supervised analysis via 5 models with k-fold cross-validation:

• K Nearest Neighbors

• Multi-class Support Vector Machine

• Random Forest

• Neural Networks with PCA Analysis

• Penalized Discriminant Analysis

8/25

Algorithm 1 Supervised Analysis: 10-fold cross-validation with 3 times resampling

for i← 1 : 3 dofor j ← 1 : 10 do

Split dataset D = zl, l = 1, 2, ..., n into k chunks so that n = KmForm subset Vj = zl ∈ D : i ∈ [1 + (j − 1)×m, j ×m]Extract train set Tj := D\VjBuild estimator g(?)(·) using TjCompute predictions g(j)(xl) for zk ∈ VjCalculate the error εj = 1

m

∑zl∈Vj l(yl, g

(j)(xl))

end forCompute CV(g) = 1

K

∑Kj=1 εj

Find g(?)(·) = argminj=1:J

CV(g(·))with lowest prediction error

end for

4.1 K-Nearest Neighbors

kNN predicts the class of song via finding the k most similar songs, where the similarity is mea-sured by Euclidean distance between two song vectors in this case. The class (label) here is the 7musicians: Duke, Miles, John, Charlie, Louis, Bill, Monk.

Algorithm 2 k-Nearest Neighbors

for i← 1 : n doChoose the value of k for D = (x1, Y1), ..., (xi, Yi), ..., (xn, Yn), Yi ∈ 1, ..., gLet x? be a new point. Compute d?i = d(x?,xi)

end forRank all the distance d?i in order: d?(1) ≤ d

?(2) ≤ ... ≤ d

?(k) ≤ ... ≤ d

?(n)

Form Vk(x?) = xi : d(x?,xi) ≤ d?(k)Predict response Y ?kNN = Most frequent label in Vk(x?) = argmax

j∈1,...,gp(k)j (x?)

where p(k)j (x?) = 1

k

∑xi∈Vk(x?) I(Yi = j)

4.2 Support Vector Machine

The task of Support Vector Machine (SVM) is to find the optimal hyperplane that separates the obser-vations in such a way that the margin is as large as possible. That is to say, the distance between thenearest sample patterns (support vectors) should be as large as possible. SVM is originally de-signed as binary classifier, so in this case there are more than two classes, we use multi-class SVM.Specifically, we transform single multi-class task into multiple binary classification task. We trainK binary SVMs and maximize the margins from each class to the remaining ones. We choose linearkernel (Eq.1) due to its excellent performance on high dimensional data that are very sparse in textmining.

K(xi,xj) =< xi,xj >= x>i xj (1)

9/25

Algorithm 3 Multi-class Support Vector Machine

for k ← 1 : K doGiven D = (x1, Y1k), ..., (xi, Yik), ..., (xn, Ynk), Yik ∈ +1,−1Find function h(x) = w>x + b that achieves

maxw,b

[minyik=+1

(|w>xi+b|||w||

)+ minyik=−1

(|w>xi+b|||w||

)]= max

w,b

2||w|| = min

w,b

12 ||w||

2

subject to Yik(w>xi + b) ≥ 1,∀i = 1, 2, ..., nend forGet argmax

k=1,...,Kfk(x) = argmax

k=1,...,K(w>k x + bk)

4.3 Random Forest

Random Forest (RF) as an ensemble learning method that optimal the performance of single tree.Compared with tree bagging, the only difference in random forest is that then select each treecandidate with random subset of features, called "feature bagging", for correction of overfitting issueof trees. If some features weigh more strongly than other features, these features will be selected inmany of B trees among the whole forest.

Algorithm 4 Random Forest

for b← 1 : B doDraw with replacement from D a sample D(b) = z(b)

1 , ..., z(b)n

Draw subset i(b)1 , ..., i(b)d of d variables without replacement from 1, 2, ..., p

Prune unselected variables from the sample D(b) to ensure D(b)sub is d dimension

Build tree (base learner) g(b) based on D(b)sub

end forOutput the result based on the mode of classes gRF (x) = argmax

j∈1,...,Bp(b)j (x)

where p(k)j (x?) = 1

B

∑I(g(b)(x) = j)

4.4 Neural Network with PCA Analysis

Principal Components Analysis (PCA) as one of the most common dimension reduction methodscan help improve the result of classification. Neural Network with Principal Component Analysismethod proposed by Ripley [2007] is to run principal component analysis on the data first and thenuse the component in the neural network model. Each predictor has more than one values as thevariance of each predictor is used in PCA analysis, and the predictor only has one value wouldbe removed before the analysis. New data for prediction are also transformed with PCA analysisbefore feed to the networks.

10/25

Algorithm 5 Neural Network with PCA Analysis

Given data D = x1, ...,xn,xi ∈ Rm, finding Σ as estimatesfor i← 1 : p do

Obtain eigenvalues λi and eigenvectors ei from ΣObtain principal components yi = e>j X

end forGet p-dimensional input vector y = (y1, y2, ..., yp)

> after PCA analysisfor j ← 1 : q do

Compute linear combination hj(y) = β0j + β>j y for each node in hidden layerPass hj(y) through nonlinear activation function zj = ψ(β0j +

∑pl=1 βljyl)

end forCombine zj with coefficients to get η(y) = γ0 +

∑qj=1 γjψ(β0j +

∑pl=1 βljyl)

Pass η(y) with another activation function to output layer µk(y) = φk(η(y))

4.5 Penalized Discriminant Analysis

Linear Discriminant Analysis (LDA) is common tool for classification and dimension reduction.However, LDA can be too flexible in the choice of β with highly correlated predictor variables.Hastie et al. [1995] came up with Penalized Discriminant Analysis (PDA) to avoid the overfittingperformance resulting from LDA. Basically a penalty term is added to the covariance matrix Σ′W =ΣW + Ω.

Algorithm 6 Penalized Discriminant Analysis

for i← 1 : n doGiven data D = (x1, Y1), ..., (xn, Yn),xi ∈ RqCompute within-class covariance matrix Σw =

∑ni=1(xi − µyi)(xi − µyi)> + Ω

Compute between-class covariance matrix Σb =∑mj=1 nj(xj − µyj )(xj − µyj )>

end forMaximize the ratio of two matrices: w = argmax

w

w>Σbw

w>Σww

5 Topic Modeling

5.1 Intuition Behind Model

Similar to the work from Blei [2012] in text mining, Figure 7 illustrates the intuition behind ourmodel in music concept. We assume an album, as a collection of songs, are mixture of differenttopics (melodies). These topics are the distributions over a series of notes (left part of the figure).In each song, notes in every measure are chosen based on the topic assignments (colorful tokens),while the topic assignments are drawn from the document-topic distribution.

11/25

Figure 7. Intuition behind Music Mining

5.2 Model

α θ z u β η

L

N

M

K

Dirichlet: p(θ|α) =Γ(∑i αi)∏

i Γ(αi)

K∏i=1

θαi−1i p(β|η) =

Γ(∑i ηi)∏

i Γ(ηi)

K∏i=1

θηi−1i (2)

Multinomial: p(zn|θ) =

K∏i=1

θzini p(xn|zn, β) =

K∏i=1

V∏j=1

β(zinx

jn)

ij (3)

Notation

• u: notes (observed)

• z: chord per measure (hidden)

• θ chord proportions for a song (hidden)

• α: parameter controls chord proportions

• β: key profiles

• η: parameter controls key profiles

12/25

5.3 Generative Process

1. Draw θ ∼ Dirichlet(α)

2. For each harmony k ∈ 1, ...,K

• Draw βk ∼ Dirichlet(η)

3. For each measure un (notes in nth measure) in song m

• Draw harmony zn ∼Multinomial(θ)

• Draw pitch in nth measure xn|zn ∼Multinomial(βk)

Terms for single song:

p(θ|α) =Γ(∑i αi)∏

i Γ(αi)

K∏i=1

θαi−1i (4)

p(β|η) =Γ(∑i ηi)∏

i Γ(ηi)

K∏i=1

θηi−1i (5)

p(zn|θ) =

K∏i=1

θzini (6)

p(xn|zn, β) =

K∏i=1

V∏j=1

β(zinx

jn)

ij (7)

Joint Distribution for the whole album:

p(θ, z,x|α, β, η) =

K∏k=1

p(β|η)

M∏m=1

p(θ|α)( N∏n=1

p(zn|θ)p(xn|zn, β))

(8)

Summary

• Assume there are M documents in the corpus.

• The topic distribution under each document is a Multinomial distribution Mult(θ) with itsconjugate prior Dir(α).

• The word distribution under each topic is a Multinomial distribution Mult(β) with the con-jugate prior Dir(η).

• For the nth word in the certain document, first we select a topic z from per document-topicdistributionMult(θ), then select a word under this topic x|z from per topic-word distributionMult(β).

• Repeat for M documents. For M documents, there are M independent Dirichlet-MultinomialDistributions; for K topics, there are K independent Dirichlet-Multinomial Distributions.

13/25

5.4 Estimation

For per-document posterior is

p(β, z, θ|x, α, η) =p(θ, β, z,x|α, η)

p(x|α, η)=

p(θ|α)∏Nn=1 p(zn|θ)p(xn|zn, β1:K)∫

θp(θ|α)

∏Nn=1

∑Kz=1 p(zn|θ)p(xn|zn, β1:K)

(9)

Here we use Variational EM (VEM) instead of EM algorithm to approximate posterior inferencebecause the posterior in E-step is intractable to compute.

Figure 8. Variational EM Graphical Model

Blei et al. [2003] proposed a way to use variational term q(β, z, θ|λ, φ, γ) (Eq.10) to approximate theposterior p(β, z, θ|x, α, η) (Eq.11). That is to say, by removing certain connections in the graphicalmodel in Figure 8, we obtain the tractable version of lower bounds on the log likelihood.

q(β, z, θ|λ, φ, γ) =

K∑k=1

Dir(βk|λk)

M∑d=1

(q(θd|γd)N∑n=1

q(zdn|φdn)) (10)

p(β, z, θ|x, α, η) =p(θ, β, z,x|α, η)

p(x|α, η)(11)

With the simplified version of posterior distribution, we aim to minimize the KL Distance (Kull-back–Leibler divergence) between the variational distribution q(β, z, θ|λ, φ, γ) and the posteriorp(β, z, θ|x, α, η) to obtain the optimal value of the variational parameters γ, φ, and λ (Eq.13). Thatis to obtain the maximum lower bound L(γ, φ, λ;α, η) (Eq.14).

lnp(x|α, η) = L(γ, φ, λ;α, η) +D(q(β, z, θ|λ, φ, γ)||p(β, z, θ|x, α, η)) (12)(λ∗, φ∗, γ∗) = argmin

λ,φ,γD(q(β, z, θ|λ, φ, γ)||p(β, z, θ|x, α, η)) (13)

L(γ, φ, λ;α, η) = Eq[lnp(θ|α)] + Eq[lnp(z|θ)] + Eq[lnp(β|η)] + Eq[lnp(x|z, β)]

− Ez[lnq(θ|γ)]− Eq[lnq(z|φ)]− Ez[lnq(β|λ)] (14)

14/25

Algorithm 7 Variational EM for Smoothed LDA in Sheet Music

for t← 1 : T doE-stepFix model parameters α, η. Initialize φ0

ni := 1k , γ

0i := αi + N

k , λ0ij := η

for n← 1 : N dofor i← 1 : k doφt+1ni := exp(Ψ(γti ))

∏Vj=1 β

xjnij

end forNormalize φt+1

n to sum to 1end forγt+1 := α+

∑Nn=1 φ

t+1n

λt+1j := η +

∑Md=1

∑Nd

n=1 φt+1dn xjdn

M-stepFix the variational parameters γ, φ, λMaximize lower bound with respect to model parameters η, αuntil converge

end for

6 Implementation

In this section we implement pattern recognition and topic modeling methods with two representa-tion (note-based representation and measure-based representation) demonstrated previously, andevaluate performance of different representations in diverse scenarios.

6.1 Pattern Recognition

6.1.1 Note-Based Model

Figure 9. Pattern Recognition on Jazz and Chinese Music

15/25

Figure 10. Pattern Recognition on Jazz and Japanese Music

Figure 11. Pattern Recognition on Jazz and Arabic Music

6.1.2 Measure-Based Model

Figure 12. Pattern Recognition on Different Jazz Musicians

6.1.3 Comments and Conclusion

For note-based model we can see that the five supervised machine learning techniques could allclassify different music genre with error rate no more than 35%. In addition, the performance of

16/25

random forest, k nearest neighbors, and neural networks with PCA analysis are much better thanthe other two methods. Among the three comparisons (Jazz vs. Chinese music, Jazz vs. Japanesemusic, Jazz vs. Arabic music), the comparison of Jazz vs. Chinese would give better result than theother two, with random forest reaching lower than 0.1 error rate. For recognition between Jazz andChinese songs, random forest is the best one with lowest error rate and variance. For recognitionbetween Jazz and Japanese songs, k nearest neighbors, neural network and random forest havecomparatively low error rate, but k nearest neighbors’ performance has smaller variance. For com-parison between Jazz and Arabic songs, neural network and random forest have comparativelylow error rate, while they all have large variance.

For measure-based model, we can see that from the confusion matrix of training set, the modelaccuracy rate is very high for all techniques expect k nearest neighbors. However, but for the testset all the model fails to provide very good result with lowest error rate as 0.4 from random forest.It is obvious that this scenario has the challenging of overfitting issue. Further investigation is nec-essary if we want to use this representation.

6.2 Topic Modeling

6.2.1 Perplexity

In topic modeling, the number of topics is crucial for the to achieve its optimal performance. Per-plexity is one way to measure how well is predictive ability of a probability model. Having theoptimal topic number is always helpful in the sense to reach the best result with minimum compu-tational time. Perplexity of a corpus D of M documents is computed as below Equation (15).

P (D) = exp

(−∑M−1d=0 log p(wd;λ)∑M−1

d=0 Nd

)(15)

Apart from the above common way, there are many other methods to find the optimal topics. Theexisting ldatuning package stores 4 methods to calculate all metrics for selecting the perfect num-ber of topics for LDA model all at once.

Table 8 shows 4 different evaluating matrices. The extrema in each scenario illustrates the optimalnumber of topics.

• minimum

– Arun2010 [Arun et al., 2010]

– CaoJuan2009 [Cao et al., 2009]

• Maximum

– Deveaud2014 [Deveaud et al., 2014]

– Griffiths2004 [Griffiths and Steyvers, 2004]

17/25

Table 8. Perplexity of Different Matrices

Topics Number Griffiths2004 CaoJuan2009 Arun2010 Deveaud2014

2 -7454.086 0.11290217 13.856421 1.86042764 -6821.928 0.07120480 8.508257 1.78779366 -6516.431 0.06146701 5.613616 1.71267438 -6322.309 0.05740186 3.728195 1.6422201

10 -6184.650 0.05336498 2.404497 1.599809816 -6112.754 0.06507096 1.328469 1.359468820 -6101.264 0.07099931 1.512142 1.224221426 -6129.508 0.09352393 1.856783 1.076061330 -6121.120 0.10582645 2.545512 0.958518936 -6177.121 0.12330036 4.078891 0.853059240 -6183.168 0.14128330 5.226102 0.776775646 -6224.206 0.15072742 5.372056 0.711927850 -6253.992 0.16448002 6.637710 0.671954760 -6352.595 0.20606817 7.769699 0.584422372 -6325.653 0.25947947 9.892807 0.474239780 -6393.940 0.26968788 10.187645 0.4463054

Figure 13. Evaluating LDA Models

From perplexity we can come to the conclusion that the optimal number of topics is around 8∼12.In this scenario Metric Deveaud2014 is not as informative as the other three.

6.2.2 Discussion

Figure 14 shows the top 10 tokens in the topics from two scenarios.

18/25

For Measure-Based Scenario, we can see some topics purely natural keys:e.g. Topic 1: [E,O,O,O,O,O,O,O] , Topic 5: [B,D,B,B,D,D,E,E].While some topics are very complicated with many sharps and flats in the notes:e.g. Topic 3: [B[,A, F,A[,B[,B[,O,O], Topic 6: [F,G, F,E,E[,B[, C],D].

For Note-Based Scenario, each token is a 12-dimension vector indicating which of the pitch are "on"in certain measure. Some of the topics contains many active notes:e.g. In Topic 2, some tokens have at most 7 active pitches.While some topics are very silent with only few active notes:e.g. In Topic 4 most pitches are mute, tokens have at most 3 active pitches.

Figure 14. Top 10 Tokens in Selected Topic in Two Scenarios

Figure 15 shows the per-topic per-word probability of Measure-Based Scenario. We can see sometopics appear very complicated with most of terms with flat or sharp notes (Topic 3, Topic 4). Sometopics are very simple (Topic 8). Some topics contain too many terms with the same probability(Topic 2, Topic 4).

19/25

Figure 15. Topic Terms Distribution from Measure-Based Scenario

Figure 16 shows the per-topic per-word probability of Note-Based Scenario. Topic 4 and Topic2 have certain distinctive terms while terms in Topic 9 have fairly similar probability. Furtherinvestigation involved musician is needed to better interpret the result.

Figure 16. Topic Terms Distribution from Note-Based Scenario

20/25

Lastly I draw chord diagram to see some potential relationship between topics learned from topicmodels and the targeted subjects.In Figure 17, we can see:

• American songs (Jazz music in this case) are particularly dominant in Topic 9, which hasmost probable term [1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1]. It can also be interpreted as pitch class set:

C,E,G,A,B,

• Arabic songs contribute mostly to Topic 3, which has various terms equally distributed (seeFigure 16).

• Most of Chinese songs attributes to Topic 4 and Topic 5 which contain most probable G majoror E minor scale E,F],B

• Japanese songs seem to have similar contribution to every topic.

In Figure 18, we can see:

• Musician John Coltrane, Sonny Rollins and Louis Armstrong has some certain preferencetowards certain topics.

• Other musicians do not show clear bias to a specific topic.

Figure 17. Chord Diagram for Music Genres

21/25

Figure 18. Chord Diagram for Jazz Music

7 Conclusion

7.1 Summary

In this paper we create two different representations for symbolic music and transform the musicnotes from music sheet into matrices for statistical analysis and data mining. Specifically, each songcan be regarded as a text body consisting of different musical words. One way to represent thesemusical words is to segment the song into several parts based on the duration of each measure.Then the words in each song turn out to be a series of notes in one measure. Another way to rep-resent music words is to restructure the notes in each segment based on the fixed 12-dimensionpitch class. Both representations have been employed in pattern recognition and topic modelingtechniques respectively, to detect music genres based on the collected songs, and figure out the po-tential connections between musicians and latent topics.

The predictive performance in pattern recognition for note-based representation turns out to bevery good with 88% accuracy rate in the optimal scenario. We explored several aspects amongmusic genres and musicians to see the hidden associations between different elements. Some gen-res contain very strong characteristics which make them very easy to detect. Jazz musicians JohnColtrane, Sonny Rollins and Louis Armstrong show their particular preference towards certaintopics. All these features are employed in the model to help better understand the world of music.

22/25

7.2 Future Work

Music mining is a giant research field, and what we’ve done is merely a tip of the iceberg. Lookback to the initial motivation that triggers us to embark on this research work: Why does music fromdiverse culture have so powerful inherent capacity to bring people so many different feelings and emotions?To ultimately find out how to replace human intelligence with statistical algorithms for melodyinterpretation is still remain to be discovered.

Several potential studies we would love to continue exploring in the foreseeable future:

• Facilitate audio music and symbolic music transformation via machine learning technique.

• Deepen the understanding of musical lexicon and grammatical structure and create the dic-tionary in a mathematical way.

• How to derive representations for smooth recognition of Jazz by statistical learning methods?

• Apart from notes, can we embed other inherent musical structure such as cadence, tempo tobetter interpret the musical words?

• Explore the improvisation key learning (how many keys do the giants of jazz tended to playin, and what are those keys).

• Musical harmonies and its connection with elements of mood.

Acknowledgments

We would like to show our gratitude to Dr. Jonathan Kruger, Dr. Evans Gouno, Mrs. RebeccaAnn Finnangan Kemp, Dr. David Guidice for sharing their pearls of wisdom with us during thepersonal communication on music lexicon.

Special big thank goes to musicians: Lizhu Lu from Eastman School of Music, Gankun Zhang fromBrandon University School of Music, Dr. Carl Atkins from Department of Performance Arts &Visual Culture, and Professor Kwaku Kwaakye Obeng from Brown University, for their encour-agement and technical supports in music thoery all the time.

Qiuyi Wu thanks RIT Research & Creativity Reimbursement Program for partially sponsoring thiswork to have it possibly presented in Joint Statistical Meetings (JSM) this year in Vancouver. Sheappreciates supports from International Conference on Advances in Interdisciplinary Statistics andCombinatorics (AISC) for NC Young Researcher Award this year. She thanks 7th Annual Con-ference of the Upstate New York Chapters of The American Statistical Association (UP-STAT) forrecognizing this work and offering her Gold Medal for Best Student Research Award this year.

References

Rajkumar Arun, Venkatasubramaniyan Suresh, CE Veni Madhavan, and MN Narasimha Murthy.On finding the natural number of topics with latent dirichlet allocation: Some observations. InPacific-Asia Conference on Knowledge Discovery and Data Mining, pages 391–402. Springer, 2010.

David M Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.

David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machineLearning research, 3(Jan):993–1022, 2003.

23/25

Juan Cao, Tian Xia, Jintao Li, Yongdong Zhang, and Sheng Tang. A density-based method foradaptive lda model selection. Neurocomputing, 72(7-9):1775–1781, 2009.

Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman.Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407, 1990.

Dharma Deva. Underlying socio-cultural aspects and aesthetic principles that determine musicaltheory and practice in the musical traditions of china and japan. Renaissance Artists and WritersAssociation, 1999.

Romain Deveaud, Eric SanJuan, and Patrice Bellot. Accurate and effective latent concept modelingfor ad hoc information retrieval. Document numérique, 17(1):61–84, 2014.

Luc Devroye, László Györfi, and Gábor Lugosi. A probabilistic theory of pattern recognition, vol-ume 31. Springer Science & Business Media, 2013.

Tuomas Eerola and Petri Toiviainen. Midi toolbox: Matlab tools for music research. 2004.

Evans Gouno. personal communication.

Thomas L Griffiths and Mark Steyvers. Finding scientific topics. Proceedings of the National academyof Sciences, 101(suppl 1):5228–5235, 2004.

Trevor Hastie, Andreas Buja, and Robert Tibshirani. Penalized discriminant analysis. The Annals ofStatistics, pages 73–102, 1995.

Thomas Hofmann. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth conference onUncertainty in artificial intelligence, pages 289–296. Morgan Kaufmann Publishers Inc., 1999.

Diane J Hu. Latent dirichlet allocation for text, images, and music. University of California, SanDiego. Retrieved April, 26:2013, 2009.

Diane J. Hu and Lawrence K. Saul. A probabilistic topic model for unsupervised learning of musicalkey-profiles, 2009a.

Diane J Hu and Lawrence K Saul. A probabilistic topic model for music analysis. In Proc. of NIPS,volume 9. Citeseer, 2009b.

Rebecca Ann Finnangan Kemp. personal communication.

Jonathan Kruger. personal communication.

Carol L. Krumhansl and Edward J. Kessler. Tracing the dynamic changes in perceived tonal organi-zation in a spatial representation of musical keys. Psychological Review, 89(4):334–368, 1982. doi:10.1037//0033-295x.89.4.334.

Carol L Krumhansl and Mark Schmuckler. A key-finding algorithm based on tonal hierarchies.Cognitive Foundations of Musical Pitch, pages 77–110, 1990.

Yann Le Cun, Ofer Matan, Bernhard Boser, John S Denker, Don Henderson, Richard E Howard,Wayne Hubbard, LD Jacket, and Henry S Baird. Handwritten zip code recognition with mul-tilayer networks. In [1990] Proceedings. 10th International Conference on Pattern Recognition, vol-ume 2, pages 35–40. IEEE, 1990.

H Christopher Longuet-Higgins and Mark J Steedman. On interpreting bach. Machine intelligence,6:221–241, 1971.

Jon D Mcauliffe and David M Blei. Supervised topic models. In Advances in neural informationprocessing systems, pages 121–128, 2008.

24/25

Brian D Ripley. Pattern recognition and neural networks. Cambridge university press, 2007.

Julia Silge. The game is afoot! topic modeling of sherlock holmes stories, 2018.

David Temperley et al. Music and probability. Mit Press, 2007.

P. Toiviainen and T. Eerola. MIDI toolbox 1.1. https://github.com/miditoolbox/, 2016.

25/25