a probabilistic context-free grammar for melodic reduction

A Probabilistic Context-Free Grammar forMelodic Reduction ?

Édouard Gilbert1 and Darrell Conklin2

1 Department of Computer Science and Telecommunications,École Normale Supérieure de Cachan, Brittany, [email protected]

2 Department of Computing,City University, London, United Kingdom

[email protected]

Abstract. This article presents a method used to find tree structuresin musical scores using a probabilistic grammar for melodic reduction.A parsing algorithm is used to find the optimal parse of a piece withrespect to the grammar. The method is applied to parse phrases fromBach chorale melodies. The statistical model of music defined by thegrammar is also used to evaluate the entropy of the studied pieces andthus to estimate a possible information compression figure for scores.

Introduction

Music has a high-level structure and can be considered as much more than justa sequence of events. The field of Natural Language Processing has developedmany tools to find structures in written sentences. This article explores the useof one of these tools (the probabilistic context-free grammar) on musical scores,more specifically on Bach chorale melodies.

These grammars are to context-free grammars what hidden Markov modelsare to regular grammars. As HMMs, they define probabilistic language mod-els, can be trained on a corpus or inferred from an already parsed corpus, butin addition define a more powerful language class. Furthermore, the existenceof polynomial time algorithm ensures reasonable computation time for parsingalgorithms.

While some other proposals for melodic reduction exist, such as the one de-veloped by Lerdahl and Jackendoff [1], the grammar developed here is moreprecisely defined at a micro-structural level and uses a probabilistic means todeal with ambiguity. Furthermore, the method captures hierarchical relation-ships among pitches, like Schenkerian analysis [2]. However, the method doesnot necessarily reduce different scores to the same background level.? Please cite as “Gilbert, É. and Conklin, D. : A Probabilistic Context-Free Grammar

for Melodic Reduction. In Proceedings of the International Workshop on ArtificialIntelligence and Music, 20th International Joint Conference on Artificial Intelligence(IJCAI), Hyderabad, India. (2007) 83–94”

The aim of the method is not necessarily to find the parse that a typicallistener may assign to a melody, but rather to define a statistical model of musicwhich is compact, computable, and potentially more powerful than finite-state(Markov) approaches. As a consequence, no psychological preference rules areexplicitly included in the grammar. It should also be noted that the methoddoes not take tonality into account but rather defines melodic reduction purelyin terms of melodic and metrical intervals.

1 Methods

The aim of this work is to compute an optimal parse tree for given melodies.The very first step was thus to define a grammar. In this section the (non-probabilistic) grammar is defined, then the simple extension of this grammar toa probabilistic context free grammar (PCFG) is developed.

1.1 A context-free grammar

A grammar of melody can simplify or reduce a melody by separating melodicelaboration notes from the others. The reduction may be applied recursively,meaning that the top level of the reduction contains only a compact backgroundstructure.

The rules used in this study are widely used by composers. Their namesmatch the idea behind them: a repeat rule to split a note in two, a passing ruleto add a note in order to fill a gap, and a neighbour rule which develops a noteby adding another one whose pitch is close. Finally, an escape rule, which is usedwhen a tone escapes by going slightly in the opposite direction. In order to beable to parse any score, a new rule is also added; it allows any note to appear,but should not be used inside a parse tree, i.e. it should be used only with thefirst note of the piece or with a note added using a new rule. It therefore is usedto merge elaboration trees using only the other rules and is present only on thehighest background level.

One example of each rule is depicted in Figure 1 (for brevity, the inverses ofrules, for example a lower neighbour tone rule, are not presented in the figure).Note that the names may not match what a musician would expect. For example,passing tones are defined exclusively with respect to intervals, and do not taketonality into account.

Some elaborations would not be context-free with the direct use of pitches,as can be seen on the last three examples of Figure 1. These rules were thereforemodified so that they were defined based on intervals instead. This ensures thata rule such as the passing tone rule is context-free. If one would use pitches, sucha rule would rewrite two pitches (e.g., a C and an E) as three (a C, a D and anE). Using intervals, any rule in the context-dependent form

note1 note2 −→ note1 note′ note2

2

New (nw) : −→

Repeat (rp) : −→

Neighbour (nb) : −→

Passing (ps) : −→

Escape (es) : −→

Fig. 1. Examples of every rule used in the grammar defined in this article.

can be rewritten into

intervalnote1 note2 −→ intervalnote1 note′ intervalnote′ note2

which is context-free. Metric levels, which state whether a beat of the measure isstrong or weak, were also taken into account. Metric levels are defined as follow:for duple meters the first beat of every measure is given a metric level of 1. Thebeats bisecting these are then given a metric level of 2, the beats bisecting thoseare given a metric level of 3, and so on. For triple meters, the principle is thesame except than the second and third beats of the measure are both given ametric level of 2.

In our grammar, metric levels were used to forbid the use of several rules.The neighbour and repeat rules is forbidden if the beat of the added pitch isstronger than the ones of the outer pitches. For other rules, it also allows ac-cented or unaccented version of the rules (when the added note’s metric level islower or higher than that of the first note) to have different probabilities. Again,differences between the metric levels of a note and the following one (referredin the following as metric ∆) were used instead of metric levels themselves toensure that the rules are context-free.

One of the main feature of this grammar is that it is ambiguous. Whilesometimes, the ambiguity is not of any consequence to the final backgroundstructure, as in Figure 2, it happens that they lead to very different parse trees,as seen in Figure 3.

In order to choose between the different possible parses, probabilities wereadded to the different rules. The grammar defined is thus a probabilistic context-free grammar or PCFG.

1.2 Probabilistic context-free grammars

Probabilistic context-free grammars were developed mainly for natural languageprocessing [3], where ambiguity is extensive.

As stated earlier, a probabilistic context-free grammar (PCFG) is a context-free grammar modified to give each rule a probability of being used in a parsetree, similar to the way that hidden Markov models are regular grammars withprobabilities included.

Formally, a probabilistic context-free grammar consists of :

3

2 -2 0

(a) The score to parse.

0

-2

0-22(b) One possible parse.

0

0

0

-22(c) Another possible parse.

0

(d) The score without any elabora-tions (parse 2(b)).

0

(e) The score without any elabora-tions (parse 2(c)).

Fig. 2. An ambiguity of no consequence. Once all elaborations are removed, the back-ground structures of the scores are the same. The numbers in the trees and scores arethe intervals between notes in semitones.

2-1 1 -2

(a) The score to parse.

0

0

-22

0

1-1(b) One possible parse.

-1

3

21 -2(c) Another possible parse.

0

(d) The score without any elabora-tions (parse 3(b)).

3-1 -2

(e) The score without any elabora-tions (parse 3(c)).

Fig. 3. A real ambiguity. Here, once all elaborations are removed, the backgroundstructures differ. The numbers are the intervals between notes in semitones.

4

– A set W of terminals w1 . . . wV (often referred as words)– A set N of non-terminals N1 . . . Nn

– Among these, a start symbol N1 here called the root– A set of rules

{N i → ζj

}where ζj ∈ (W + N)∗

– A corresponding set of probabilities such as for all N i∑j

P (N i → ζj | N i) = 1

The probability of a particular parse tree is the product of every rule whichappears in the parse. The probability of a sentence can then be defined as thesum of all possible parse trees. That means a PCFG defines a probability onall the possible sentences formed by a succession of terminals, possibly zero forsentences not in the language accepted by the grammar.

This leads to the advantages of using a PCFG. First, they can deal withambiguity. In the case of ambiguity, more probable parses can then be preferredto less probable ones. Second, they can be learnt easily. While a deterministiccontext-free grammar cannot be learnt without negative examples [4], a prob-abilistic grammar can for it already takes the likelihood of the sentences intoaccount. Finally there is a polynomial-time algorithm for computing the parses.

1.3 A lexicalized grammar

In the case of the grammar of this article, the terminals were all possible in-tervals (in fact, intervals were considered between -24 and 24) associated withevery possible metric ∆ (here, between -4 and 4). This leads to 49 × 9 = 441different i/m∆ pair. In practise, a range of 31 intervals is sufficient to containall the interval in the corpus, and a range of only 8 metric ∆. The total countof intervals/metric ∆ pairs considered (referred in the following as i/m∆ pair)is thus 248.

As the new rule is meant to add notes which are not part of an elaborationand to allow any score to be parsed, it will be allowed to appear only on a spinewhich is constituted of the spine and every note added by another new rule. Ona drawn tree, those nodes would draw the diagonal on the upper right. In orderto make that distinction appear in the grammar, two non-terminals were used.The first one, S, is used for the root and on the spine, the other one, I, is usedeverywhere else. As the non-terminal are different, the neighbour rule splittingan interval of the spine would will a priori not have the same probabilities.

The values of the i/m∆ pairs are used to give different i/m∆ pairs differentprobabilities. This is done by considering the non-terminals to be a couple com-posed of a “true” non-terminal (here S and I) and a non-terminal (here, an i/m∆pair), and is called lexicalisation. The modified grammar is said to be lexicalized.

The grammar is thus defined as follow:

– The terminals: W = {−24, . . . , 24} × {−4, . . . 4}– The non-terminals: N = {S, I}– The start symbol S.

5

The rules are described in Figure 4. Please note that the rules are in fact ruleschemata, allowing a concise representation of the grammar. The last rule schemais used in order to transform a non-terminal into a terminal.

New (nw) : S[N ] −→ I[N1] S[N2]Repeat (rp) : S/I[N ] −→ I[N ] I[0]

Neighbour (nb) : S/I[0] −→ I[N1] I[N2]with N1 = −N2, |N1| ≤ n1

Passing (ps) : S/I[N ] −→ I[N1] I[N2]with N1 + N2 = N , N1N2 > 0and |N | ≤ n2

Escape (es) : S/I[N ] −→ I[N1] I[N2]with N1 + N2 = N , N1N2 < 0and |N1| ≤ n3

Replace by a terminal : S/I[N ] −→ N

Fig. 4. The structure of the PCFG used in this article. Read S/I as any of both non-terminals S and I. Every line in the grammar is in fact a rule schema. Please note themetric levels are not stated here.

1.4 Training the grammar

PCFGs can be trained in order to improve the probability of the pieces from aspecific training corpus. A polynomial time expectation-maximisation algorithm,similar to the Baum-Welch algorithm in the case of hidden Markov models, wasimplemented. This algorithm has the drawback than some rules can receive a zeroprobability. It can thus happen that some pieces which are not in the trainingcorpus cannot be parsed after training while they where before. It should alsobe noted that this algorithm does not compute the optimal grammar, but onlyreaches a local maximum of the probability of the corpus.

This situation was never encountered when using intervals only, but occurredseveral times when working on i/m∆ pair. To cope with that problem, a smooth-ing function was written to ensure that the new rule can always be used.

2 Results

The grammar used here was trained and tested on a corpus of about 1350 phrasesof the first soprano voices from 185 Bach chorales (from bwv 253 to bwv 348).1200 of these phrases were used to train the grammar and about 150 to test it.The phrases boundaries were decided thanks to the pauses in the scores. Thetraining and test corpora were selected at random from the complete corpus.

6

2.1 Parses and representation

Since the grammar results are binary trees, and use interval and metric ∆ asnodes, reading them can be quite difficult. A printing convention was thereforedeveloped in order to present parses in a readable musical notation.

As a first step to make a tree more readable, it is flattened such that everyoccurrence of new rules are on the highest layer of the tree. This transformationis possible because, as the grammar is written, all of these occurrences are infact always on the right-most side of every layer of the tree. Flattening the treefrom Figure 5(a) gives the tree represented on Figure 5(b). The rules used arealso annotated on the tree.

This flattened tree is then engraved into a score representation using Lilypond(www.lilypond.org). Every layer of the tree is given a specific staff in order toshow which notes appears after how many melodic reduction. Every layer thusmatches a certain depth of elaborations.

Figure 6 represents different parses of the same chorale (bwv 270, phrase 1).When ignoring metric levels and without training, a very unlikely neighbourtone on a strong beat (the E) is considered as an elaboration. After training orwhen taking metric levels into account, this reduction is not applied anymore,and one can see a descending triad appearing on the last layer of the parse.

The parses of the sixth phrase of bwv 268, “Auf, auf, mein Herz”, which canbe seen on Figure 7 are interesting too. If in the second parse, a new rule is usedwhere one would expect a repeat rule, the parse also excludes several neighbourand prefers the use of passing, making triads appear.

2.2 Compression

One objective measure of performance of a statistical model of music is accordingto the information compression it can provide to pieces not used in training. Thisinformation compression can be measured in terms of the average number of bitsrequired to encode an event.

As stated in section 1.3, there are 248 different possible i/m∆ pair. Everypair thus requires log2 248 = 7.95 bits to encode using a null model. The prob-ability of every parse tree in the test corpus was computed in order to get theprobability of every event. Using only intervals, the compression achieved on thetest corpus by the grammar without taking metric levels into account was of2.57 bits per interval, going down from log2 31 = 4.95 bits to 2.67 bits used toencode every interval. Conklin and Witten [5] achieved better results, reachingan entropy 1.87 bits per pitch. Note however that they used pitches and notintervals, retaining even slightly more information, so that our results here arenot yet directly comparable.

When taking metric levels into account, the use of a smoothing function isnecessary to ensure every piece in the test corpus can be parsed. This functionslightly modifies the grammar and ensure that the new rule can always be ap-plied. Using that function, a compression level of 3.98 bits per i/m∆ pair wasobtained, going down from 7.95 bits to lower than 4 bits per pair. However, it

7

S[12]

S[5]

S[1]

S[0]

I[1]I[−1]I[1]

I[4]

I[2]I[2]

I[7]

I[0]I[7]

(a) The parse tree of bwv 285, phrase 4.

nb: 0

1-11

ps: 4

22

rp: 7

07(b) The same tree flattened in order tohave all the new interval on the highestlevel, with the rule used annotated.

(c) The score representation of the same tree.

Fig. 5. A parse tree and the corresponding flattened tree and score representation. Thepiece parsed here is the fourth phrase of Bach chorale “Da der Herr Christ zu Tischesass”, bwv 285. Every note appearing on the lowest layer matches a new rule, exceptfor the second note under the slur which matches a repeat rule. The first note whichdisappears between the two last layers is the consequence of a passing rule and thesecond one of a neighbour rule.

8

(a) The parse obtained when using the grammar which does not take metriclevels into account and before training.

(b) The parse obtained when using the same grammar after 10 training itera-tions or using the grammar with metric levels (both before and after 10 trainingiteration).

(c) The parse proposed by Lerdahl and Jackendoff in [1] for the same score.

Fig. 6. Three different parses of the first phrase of Bach chorale “Befiehl du deineWege”, bwv 270.

9

(a) The parse obtained when using the grammar with metric levels, before train-ing.

(b) The parse obtained when using the grammar with metric levels this timeafter 10 training iterations.

Fig. 7. Two different parses of the sixth phrase of Bach chorale “Auf, auf, mein Herz”,bwv 268. The second one does not reduce to what one would expect to be a repetition,but rather makes two triads appears by using passing tones instead of using neighbourtones on strong beats.

10

seems likely than using a bigger training corpus would make the need for thetraining function less important (33 pieces out of the training corpus need it tobe parsed) and would probably improve the compression.

3 Discussion and conclusion

The methods used in this article to parse scores and those of Lerdahl and Jack-endoff [1] are quite different, although the way they represent the trees (usingseveral layers of the tree) are similar. The first main difference is that Lerdahland Jackendoff’s method is done in several steps. These steps can have an influ-ence on each other, mainly in order to take into account the surrounding notes.As the current methods uses intervals and a context-free grammar, there areno such complex influences. The other main difference is the order in which thedifferent rules are applied. Lerdahl and Jackendoff’s method applies the differentpossible rules in a given order. Some rules are therefore given precedence overothers. In the case of the method described here, some rules are also preferred toother, but not in a given order. Only the probability of a rule, derived throughtraining on a corpus, can make it a frequent or a rare one.

Several improvements to this work are possible. Some of the parses obtainedseem strange, as some obvious melodic reduction can still be applied on thelayer of new notes, as in Figure 5(c) or 7(b). This might be due to the choiceof a training algorithm which consider every possible parse. As the new rulecan always be applied, it is likely to be overrated. Using only the most probableparse of every score for training, such as in a Viterbi-like algorithm, may solvethese problems. Some rules, such as triadic rules might also be useful for someparses, such as the ones in Figure 6(b) or 7(b).

More generally, the method itself has flaws : the reliance on a particular setof reduction rules could be discussed for one could think about other intuitiverules that might lead to quite different parses. The complexity of the parsingalgorithm, even if polynomial, could also prevent the parse of very long pieces.Moreover, the training of the grammar without the use of a treebank is knowto lead to results which depend much on the starting grammar in the case ofnatural language processing. This instability still has to be evaluated in the casestudied here.

In another application of natural language processing, Bod [6] suggests theuse of a treebank, i.e. a corpus of already parsed sentence, to train a grammarwhose aim is to detect phrase boundaries. It would be interesting to see theresult of a treebank grammar of melodic reductions, but to our knowledge, thereis no such corpus.

The ambiguity of the grammar when using metric ∆ should be evaluatedin some way. It might appear that ambiguity seldom appears when combiningmelodic with metric ∆. It would also be interesting to see the influence of thecorpus on the grammar. This work is preliminary, the choice of Bach choralesthus allowed to work on a stylistically restricted type of melody. But would other

11

styles or composers define different probabilities for production rules or wouldthey be very similar?

Using different features could also bring new results. For example, contour ofmetric levels (i.e. does the metric level increase, decrease or remain equal) couldbe used instead of metric ∆.

Another interesting point to explore would be to look for patterns, eitheron the different layers of the tree obtained or even tree patterns within optimalparse trees.

References

1. Lerdahl, F., Jackendoff, R.: A Generative Theory of Tonal Music. The MIT Press(1983)

2. Forte, A., Gilbert, S.E.: Introduction to Schenkerian Analysis. W. W. Norton(1982)

3. Charniak, E.: Statistical techniques for natural language parsing. AI Magazine18(4) (1997) 33–44

4. Gold, M.E.: Language identification in the limit. Information and Control 10(5)(1967) 447–474

5. Conklin, D., Witten, I.H.: Multiple viewpoint systems for music prediction. Journalof New Music Research 24(1) (1995) 51–73

6. Bod, R.: Probabilistic grammars for music. In: Belgian-Dutch Conference onArtificial Intelligence (BNAIC). (2001)

7. Chemillier, M.: Toward a formal study of jazz chord sequences generated by steed-man’s grammar. Soft Computing 8(9) (2004) 617–622

8. Deutsch, D., Feroe, J.: The internal representation of pitch sequences in tonalmusic. Psychological Review 88 (1981) 503–522

9. Marsden, A.: Representing melodic patterns as networks of elaborations. Com-puters and the Humanities 35 (2001) 37–54

10. Charniak, E.: Statistical Language Learning. MIT Press, Cambridge, MA, USA(1994)

11. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Process-ing. The MIT Press, Cambridge, Massachusetts (1999)

12. Lewin, D.: Generalized Musical Intervals and Transformations. Yale UniversityPress (1987)

12

a probabilistic context-free grammar for melodic reduction

Documents