multi-part pattern analysis: combining structure … · of multiple independent parts. we argue...

8
MULTI-PART PATTERN ANALYSIS: COMBINING STRUCTURE ANALYSIS AND SOURCE SEPARATION TO DISCOVER INTRA-PART REPEATED SEQUENCES Jordan B. L. Smith Masataka Goto National Institute of Advanced Industrial Science and Technology (AIST), Japan [email protected], [email protected] ABSTRACT Structure is usually estimated as a single-level phe- nomenon with full-texture repeats and homogeneous sec- tions. However, structure is actually multi-dimensional: in a typical piece of music, individual instrument parts can re- peat themselves in independent ways, and sections can be homogeneous with respect to several parts or only one part. We propose a novel MIR task, multi-part pattern analysis, that requires the discovery of repeated patterns within in- strument parts. To discover repeated patterns in individual voices, we propose an algorithm that applies source separa- tion and then tailors the structure analysis to each estimated source, using a novel technique to resolve transitivity er- rors. Creating ground truth for this task by hand would be infeasible for a large corpus, so we generate a synthetic corpus from MIDI files. We synthesize audio and produce measure-by-measure descriptions of which instruments are active and which repeat themselves exactly. Lastly, we present a set of appropriate evaluation metrics, and use them to compare our approach to a set of baselines. 1. INTRODUCTION Music structure is important to listeners and researchers, but annotating music is hard because typical songs include multiple independent instrument parts. For example, if two sections share the same basic melody, but one features an extra horn part, should one section be labeled as a repeti- tion of the other? To decide, the annotator must consider all the ways in which the two sections are similar or dif- ferent, but the outcome of their decision is encoded in a single bit: whether the label is the same or not. The anno- tation discards many of the decisions made by the listener, especially when these are made at the timescale of entire sections. For example, the second verse of Oasis’ “Won- derwall” has the same chords and melody as the first, but different lyrics, and it includes two new instruments, cello and drums—the latter of which enters a measure late. A c Jordan B. L. Smith, Masataka Goto. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Jordan B. L. Smith, Masataka Goto. “Multi-part pattern analysis: Combining structure analysis and source separation to discover intra-part repeated sequences”, 18th International Society for Music In- formation Retrieval Conference, Suzhou, China, 2017. Figure 1. Example multi-part pattern description for the first 40 measures of“Come Together”. Measures that re- peat are given the same letter label. In this and later figures, the colors highlight repeated sequences instead of individ- ual labels: if label i is always followed by j , and j always follows i, their color assignments are merged. single large-scale section label cannot encode this interest- ing situation. The multi-dimensional nature of structure has been commented on [22], and recent corpora of annotations have addressed it in different ways: the SALAMI dataset provides descriptions at two timescales, and of functions and leading instrument [29], and the INRIA dataset de- scribes how segments and their component patterns are hi- erarchically related [2]. For music cognition research, [26] suggested that music be annotated multiple times on a per- feature basis: e.g., once while focusing only on harmony, again while focusing on timbre, and so forth. However, the challenge of hierarchy is different from the challenge of multiple independent parts. We argue that estimating the structure of these independent parts—i.e., creating a multi-part pattern analysis—should be a new MIR goal. An example of a multi-part pattern description is shown in Fig. 1. It is derived from a MIDI transcription of “Come Together” by The Beatles from the Lakh dataset [23]. It indicates whenever an instrument in the mixture re- peats itself, at the timescale of measures. This represen- tation makes clear that the organ part (here substituting the lead vocals) is varied in the second verse, while the other instruments repeat themselves exactly. Compared to a one-dimensional structural analysis, the richer detail of a multi-part description would be more suitable for applica- 716

Upload: others

Post on 03-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MULTI-PART PATTERN ANALYSIS: COMBINING STRUCTURE … · of multiple independent parts. We argue that estimating the structure of these independent parts i.e., creating a multi-part

MULTI-PART PATTERN ANALYSIS:COMBINING STRUCTURE ANALYSIS AND SOURCE SEPARATION

TO DISCOVER INTRA-PART REPEATED SEQUENCES

Jordan B. L. Smith Masataka GotoNational Institute of Advanced Industrial Science and Technology (AIST), Japan

[email protected], [email protected]

ABSTRACT

Structure is usually estimated as a single-level phe-nomenon with full-texture repeats and homogeneous sec-tions. However, structure is actually multi-dimensional: ina typical piece of music, individual instrument parts can re-peat themselves in independent ways, and sections can behomogeneous with respect to several parts or only one part.We propose a novel MIR task, multi-part pattern analysis,that requires the discovery of repeated patterns within in-strument parts. To discover repeated patterns in individualvoices, we propose an algorithm that applies source separa-tion and then tailors the structure analysis to each estimatedsource, using a novel technique to resolve transitivity er-rors. Creating ground truth for this task by hand wouldbe infeasible for a large corpus, so we generate a syntheticcorpus from MIDI files. We synthesize audio and producemeasure-by-measure descriptions of which instruments areactive and which repeat themselves exactly. Lastly, wepresent a set of appropriate evaluation metrics, and usethem to compare our approach to a set of baselines.

1. INTRODUCTION

Music structure is important to listeners and researchers,but annotating music is hard because typical songs includemultiple independent instrument parts. For example, if twosections share the same basic melody, but one features anextra horn part, should one section be labeled as a repeti-tion of the other? To decide, the annotator must considerall the ways in which the two sections are similar or dif-ferent, but the outcome of their decision is encoded in asingle bit: whether the label is the same or not. The anno-tation discards many of the decisions made by the listener,especially when these are made at the timescale of entiresections. For example, the second verse of Oasis’ “Won-derwall” has the same chords and melody as the first, butdifferent lyrics, and it includes two new instruments, celloand drums—the latter of which enters a measure late. A

c© Jordan B. L. Smith, Masataka Goto. Licensed under aCreative Commons Attribution 4.0 International License (CC BY 4.0).Attribution: Jordan B. L. Smith, Masataka Goto. “Multi-part patternanalysis: Combining structure analysis and source separation to discoverintra-part repeated sequences”, 18th International Society for Music In-formation Retrieval Conference, Suzhou, China, 2017.

Figure 1. Example multi-part pattern description for thefirst 40 measures of“Come Together”. Measures that re-peat are given the same letter label. In this and later figures,the colors highlight repeated sequences instead of individ-ual labels: if label i is always followed by j, and j alwaysfollows i, their color assignments are merged.

single large-scale section label cannot encode this interest-ing situation.

The multi-dimensional nature of structure has beencommented on [22], and recent corpora of annotationshave addressed it in different ways: the SALAMI datasetprovides descriptions at two timescales, and of functionsand leading instrument [29], and the INRIA dataset de-scribes how segments and their component patterns are hi-erarchically related [2]. For music cognition research, [26]suggested that music be annotated multiple times on a per-feature basis: e.g., once while focusing only on harmony,again while focusing on timbre, and so forth. However,the challenge of hierarchy is different from the challengeof multiple independent parts. We argue that estimatingthe structure of these independent parts—i.e., creating amulti-part pattern analysis—should be a new MIR goal.

An example of a multi-part pattern description is shownin Fig. 1. It is derived from a MIDI transcription of “ComeTogether” by The Beatles from the Lakh dataset [23].It indicates whenever an instrument in the mixture re-peats itself, at the timescale of measures. This represen-tation makes clear that the organ part (here substitutingthe lead vocals) is varied in the second verse, while theother instruments repeat themselves exactly. Compared toa one-dimensional structural analysis, the richer detail of amulti-part description would be more suitable for applica-

716

Page 2: MULTI-PART PATTERN ANALYSIS: COMBINING STRUCTURE … · of multiple independent parts. We argue that estimating the structure of these independent parts i.e., creating a multi-part

tions like automatically editing videos or choreographiesto match an audio file.

We make four main contributions in this work. First,we define a new goal for MIR research; second, we pro-pose an algorithm for accomplishing it, which uses exist-ing technology and some new techniques; third, we pro-pose an evaluation framework for the task, including met-rics, baselines, and how to obtain ground truth; and finally,we conduct an evaluation.

In the next section, we discuss how our proposed taskrelates to existing MIR tasks. We present our algorithm inSection 3, present the evaluation framework in Section 4,and discuss the results in Section 5.

2. RELATED WORK

Identifying repeating motives has long been of interestto musicologists in MIR. Although most research in thisarea has focused on symbolic data analysis (see, e.g., [5]),when “Discovery of Repeated Themes” was added toMIREX in 2013, it included both symbolic and audiotracks (e.g., [21])—but the focus of that task is different:in it, the challenge is precisely to ignore the differencesbetween instruments (if the piece being analyzed containsmultiple parts) as well as, potentially, to ignore differencesin key or modality. Our task, multi-part pattern analysis,involves a separate challenge: discovering repetitions ex-pressed by a single voice within the mixture.

Since it involves describing the independent patterns ina mixture of tracks, the task is clearly related to source sep-aration. Recently, approaches to source separation havebecome more structural, taking better advantage of theredundancies offered by repetition in music. One com-mon technique, non-negative matrix factorization (NMF),separates sources by modeling steady states in the spec-trum; an extended version, NMF decomposition (NMFD),models short sequences that are time-varying but exactly-repeating [28], and NMF was recently used to detectlong loops [15]. Median filtering, which was used to ef-ficiently perform harmonic-percussive source separation(HPSS) [6], was used in the REPET algorithm to separatea repeating background from a mixture [14]; REPET waslater adapted to looping backgrounds that change over timeand heterogenous backgrounds [25]. Although estimatinga multi-part pattern analysis will require source separa-tion, the desired output is an abstract description, not a setof separated tracks. Thus, whereas a source separation isevaluated with signal reconstruction error, a pattern analy-sis will be evaluated more like a structural analysis.

As for structural analysis, it has evolved toward mod-eling hierarchy. Early segmentation-only approaches [7]were followed by approaches that also estimate labels [8],and by approaches that model similarity differently at dif-ferent timescales [11]. Since the creation of the multi-scaleSALAMI and INRIA annotations, approaches to hierarchi-cal description have been refined [17], as has the methodol-ogy for evaluating them [18]. Hierarchy is partly a conse-quence of multiple sources behaving independently: threerepetitions of the chorus could be considered the same at a

Chroma

Piano rolls

MIDI file

Audio file

Piano rollsPiano rollsPiano rolls(N channels)

Ground truth downbeats

Center

Left

Right H.

Center H.

Left H.

Right P.

Center P.

Left P.Analyze sequence structure

Right

1.BHarmonicpercussive

source separation

1.AStereo-based source

separation

Compare subsequences

Algorithm input

Ground truthsource

Multi-track

estimated description

Multi-channel ground truth

description

Evaluation

RMS

Activation function

Recurrence plot

3.Threshold

to minimize

transitivity errors

2.k-means binary

threshold

Segment labels

Figure 2. Algorithm and ground truth generation pipeline.

coarse timescale (the context of the song), but differencesin range or instrumentation could differentiate them at afiner timescale (the context of the three choruses). Mod-els of hierarchy will always be ambiguous, since its per-ception is ambiguous [12]. In contrast, the multi-layeredcomposition of a song can be described more concretely.Thus, multi-part pattern analysis is worth treating sepa-rately from hierarchical structure, and a good multi-partanalysis may be very useful for describing hierarchy.

Finally, two works have directly bridged source separa-tion and structural analysis: First, [10] found that structureanalysis could be performed more accurately with multi-track audio as input. Second, [27] discovered that spikesin the reconstruction error of a source separation algorithmcan indicate structural boundaries. In defining the task ofmulti-part pattern description, we hope to bring these fieldscloser together.

3. PROPOSED APPROACH

Our proposed algorithm is outlined in Fig. 2, and data atcertain intermediate steps are illustrated in Fig. 3. Thethree key stages of the algorithm are:

1. Source separation. We apply source separation tothe audio to convert the stereo recording to an estimatedmulti-track recording. We do this with two median spectralfilters [13]: first, we take the median of the left and rightchannels to estimate the center channel, and subtract thisfrom the original signals, resulting in three tracks. Second,we apply HPSS to each track [6]. Even if a track con-tains multiple pitched instruments, HPSS can separate in-struments with different attacks, such as piano vs. strings,or rhythm guitar vs. organ. We end up with 6 audio tracks(see Fig. 3b).

2. Activation function estimation. The separatedtracks may be sparse: e.g., if the left channel containsonly strings, the left-percussive component may be nearlyempty. We compute RMS to estimate when the channelsare active. At this stage, we also use the ground truthdownbeat labels to define our segment windows. In future

Proceedings of the 18th ISMIR Conference, Suzhou, China, October 23-27, 2017 717

Page 3: MULTI-PART PATTERN ANALYSIS: COMBINING STRUCTURE … · of multiple independent parts. We argue that estimating the structure of these independent parts i.e., creating a multi-part

0:00 0:50 1:40 2:30 3:20

(a) Original audio

(b) Source-separated audio

0 10 20 30 40 50 60 70

(c) Estimated activation functions

0 25 50

(d) Center-Harmonic SSM

0 25 50

0

20

40

60

(e) Thresholded SSM

0 10 20 30 40 50 60 70

0

5

(f) Estimated description

Figure 3. Data in intermediate stages of algorithmpipeline. The SSMs in plots (d) and (e) correspond to thecenter-harmonic track, which is the first track in plots (b)and (c). Sound example is “Across The Universe.”

work, beats and downbeats could be estimated instead.We take the mean over each window (i.e., each mea-

sure), and apply a k-means clustering to the RMS values,with k = 2, to classify windows as either silent or active.Even if the classes are very uneven, the difference betweenthe two with respect to RMS tends to be extreme enoughthat this method is effective. At the end of this stage wehave a set of estimated activation functions (see Fig. 3c).

3. Sequence analysis. We use self-similarity matri-ces (SSMs) to discover repetitions in each track. We com-pute chroma with the madmom package [4] and compute ameasure-indexed SSM: element i, j gives the cosine simi-larity between the sequences of beat-synchronized chromafeatures of the ith and jth measures. We also use thepreviously-estimated activation functions to zero out theSSM when the track was judged inactive, as shown at thebeginning and end of the track in Fig. 3d.

To estimate segment labels from the real-valued SSM,we choose a threshold t to binarize the matrix; then, to em-phasize diagonal lines, we apply a single erosion-dilationoperation (in time-lag space) with a kernal size k. Wechoose t and k in a novel way: my finding the values thatminimize the number of transitivity errors. These errorsare resolved with a novel lexical-sort approach. Transitiv-ity errors are cases where a segment i is judged to be sim-ilar to both j and k, but segments j and k are not similarto each other; resolving these inconsistencies is a difficultpart of interpreting structure from SSMs (e.g., see [20]).

12345678

12345678

14258367

1 2 3 4 5 6 7 8 1 4 2 5 8 3 6 7 1 4 2 5 8 3 6 7

14258367

12345678

1 4 2 5 8 3 6 712345678

1 2 3 4 5 6 7 8

sort columns

sort rows

re-order rows

re-order columns

delete errors

(X) 1 4 2 5 8 3 6 7

X

XY

Y

ZZ

Figure 4. Eliminating transitivity errors with lexical sort-ing. Errors appear as inconsistent blocks in the sortedSSM. We can fix the error by eliminating pixels X, or pix-els Y, or adding pixels at Z.

Given a binary SSM, we can collect repeating pixelsinto groups by sorting the rows lexically (i.e., alphabeti-cally). The process is illustrated in Fig. 4: after sorting theSSM’s rows and columns, groups of repeating elementsbecome blocks on the main diagonal, and all other pixelsrepresent transitivity errors. The third SSM in Fig. 4 canbe fixed in three ways: zeroing the pixels at X, or at Y,or adding pixels at Z. We greedily eliminate the errors bywalking along the main diagonal from the upper left anddiscarding off-diagonal elements that do not fit the currentblock, which corresponds to zeroing X. When the cleanedSSM is re-ordered, the result is guaranteed to be transitive.

We call the number of pixels deleted from an SSM the“strain”, and the number of off-diagonal pixels that remainthe “coverage”. (For the example in Fig. 4, strain is 2 andcoverage is 4.) Our goal is to choose t and k to maximizecoverage and minimize strain, while avoiding redundantcases such as an empty SSM or an SSM that is all ones.

We sweep values of t between 0.99 and 0.8, and k be-tween 4 and 8 measures. A set of real-world examplesare shown in Fig. 5. The left column contains 5 binarySSMs derived from chroma computed on an audio track.The second and third columns give the lexically-orderedSSMs (and their strains) and their cleaned versions (andcoverages). The fourth column gives the cleaned SSMsand the difference between coverage and strain, which ismaximized by choosing k = 7. The result is a binary SSMthat is sparse but not empty, and free of transitivity errors,as in Fig. 3e. It is then trivial to label the segments. Thesix estimated part descriptions are collected in Fig. 3f.

In structure analysis, we typically search for long re-peating subsequences and long homogeneous stretches,and apply strong smoothing to the SSM to gloss over varia-tions. In contrast, the above pipeline was designed to focuson tracking shorter patterns and to find when they repeatexactly, with the expectation of obtaining a much sparserSSM with few transitivity errors.

4. EVALUATION FRAMEWORK

4.1 Data and Ground Truth

To test the quality of a multi-part pattern analysis algo-rithm, we need audio files with multiple layers, with each

718 Proceedings of the 18th ISMIR Conference, Suzhou, China, October 23-27, 2017

Page 4: MULTI-PART PATTERN ANALYSIS: COMBINING STRUCTURE … · of multiple independent parts. We argue that estimating the structure of these independent parts i.e., creating a multi-part

Difference: –16

Difference: +10

Difference: –6

Difference: –68

Difference: –6

Figure 5. Illustration of strain-coverage optimization ap-proach on a track estimated from a recording of “All MyLoving.” The four columns, from left, give: (1) binarySSMs filtered with different kernel sizes; (2) lexicallysorted SSMs; (3) SSMs with errors removed; (4) cleanedSSMS restored to original column and row order. Kernelsize 7 maximizes coverage while minimizing strain.

layer annotated to indicate repeating patterns. Creatingground truth for this task by hand would be infeasible fora large corpus. There are many public datasets of multi-track audio, but only rarely are the tracks annotated indetail. The existing dataset that most closely meets ourneeds is MedleyDB [3], which contains multitrack audioand melody f0 annotations for a subset of stems, but notannotations of repetitions in each track.

However, we can generate an appropriate dataset fromMIDI. We used a portion of Lakh MIDI dataset [23] calledthe “Clean MIDI subset”, which contains most of the Beat-les catalogue, and used FluidSynth 1 to convert these to au-dio files. When there were duplicate MIDI files to choosefrom, we selected the version where the average panningsetting of the tracks had the highest standard deviation.(Many MIDI transcriptions have no panning informationat all, which would work against our algorithm.)

We processed the MIDI files (using Pretty MIDI [24])to create, for each MIDI channel, a ground truth descrip-tion of the measure-level patterns. The procedure for thisis similar to our analysis algorithm (see Fig. 3). Froma downbeat-segmented piano roll (Fig. 6a), we obtain anactivation pattern, i.e., a timeline of 1s and 0s indicatingwhether an instrument has any MIDI note events duringeach measure-long window (Fig. 6b). Next, we estimatethe similarity of every pair of measures within a track withan SSM (Fig. 6c). To compare two piano roll windows,we take the percentage of active note spans that overlap.To focus on exact repetitions, we should use a threshold of

1 http://www.fluidsynth.org/

Figure 6. Data in intermediate stages of ground truth gen-eration for the vocal channel of “All My Loving.” Thesong’s multi-part description is shown in (e).

1.0, but in practice, due to small timing differences and ex-pressive gestures in the MIDI transcription, a threshold of1.0 leads to extremely sparse recurrence plots—but on theother hand, lowering the threshold can lead to transitivityerrors, as before. However, we found that a threshold of0.9 was generally suitable to obtain non-empty recurrenceplots without producing a large number of transitivity er-rors (Fig. 6d). Doing this for every track gives a multi-partdescription (Fig. 6e).

4.2 Evaluation Metrics

After processing the MIDI data, we obtain a “ground truth”matrix of instrument patterns A where the element Ai,j

indicates the pattern label for the ith instrument during thejth measure. (Such information is displayed in Fig. 1 andFig. 6e.) We set Ai,j = 0 when the ith instrument is notactive. Similarly, we can obtain an estimated descriptionE with elements Ei,j , such as in Fig. 3f.

To compare two single-track descriptions (two rowsof A and E), we can use any metrics from the field ofstructure analysis, such as the pairwise f -measure met-ric [19]. (For a comparison of structure evaluation metrics,see [16].) However, the rows of A and E are not necessar-ily aligned in the correct order. Moreover, the number ofestimated tracks in E may be smaller or greater than thenumber of MIDI channels in A. We present two sets ofevaluation metrics: one that requires matching the layers,and one that does not. We also devise a set of baselines.

Evaluation of descriptions. Suppose we have an N -layer estimated description and an M -layer ground truth

Proceedings of the 18th ISMIR Conference, Suzhou, China, October 23-27, 2017 719

Page 5: MULTI-PART PATTERN ANALYSIS: COMBINING STRUCTURE … · of multiple independent parts. We argue that estimating the structure of these independent parts i.e., creating a multi-part

Figure 7. Optimal pairing of estimated tracks (columns)with ground truth channels (rows), according to pairwisef -measure between descriptions (which are illustrated asrecurrence plots) The mean pwf is 0.44.

description, and let L = min(M,N). We can computethe pairwise f -measure between all pairs of layers, giv-ing a table of values like the one in Fig. 7. The Hungar-ian Algorithm 2 gives us the optimal one-to-one match-ing between L layers to maximize the average f -measure.Using the optimal pairings, we can compute the averageprecision and recall. Together, these serve as our set of3 “generous” metrics, since it does not punish cases whenN 6= M . If there are unmatched layers, whether in A or E,these should count against the estimate. We can compute astricter mean f -measure by taking pwf ∗ L/max(M,N).

Evaluation of activations. The activation matrix thatwe estimate (e.g., Fig. 3c) is an important intermediatestep. It is worth evaluating on its own, and we can doso without matching tracks to channels. We treat eachcolumn of the activation matrix, an N -dimensional binaryvector, as a ‘timbre label,’ such that each unique columngets a unique label. (This calls to mind the timbre-mixtureestimation of [1].) We perform the same process on theground truth activation matrix. Then we can use pairwisef -measure to compare the two sequences of timbre labels.

This metric ignores the difference between an instru-ment being added to or subtracted from the mixture. Toevaluate the retrieval of entrances and exits, we use a ver-sion of the boundary retrieval f -measure [19, p. 220],counting each entrance (or exit) in the ground truth as be-ing correctly estimated only if some instrument in the esti-mated description also enters (or exits, respectively) in thesame measure.

4.3 Baseline Methods

We compare our algorithm against a set of naive baselinesto gauge the success of our algorithm, but also to learn

2 https://en.wikipedia.org/wiki/Hungarian_algorithm

how the proposed evaluation metrics behave. The labelingbaselines are:

• Bconstant: all measures repeat the same pattern;

• Bnull: all measures are unique;

• Bperiodic: there are three concurrent tracks playingsequence loops of length 2, 4 and 8 measures: i.e.,three sequences [ab]∗ (i.e., ab repeated), [abcd]∗,and [abcdefgh]∗;

• Bblock: there are three concurrent tracks that alter-nate static textures with periods 2, 4 and 8 measures:i.e., [ab]∗, [aabb]∗, and [aaaabbbb]∗.

The activation matrix baselines are:

• Buniform: the song has a single texture;

• Bbuildup: new instruments enter in measures 3, 5and 7.

In addition to these naive baselines, we tested two sim-plified versions of our proposed approach. The first skipsthe source separation step: instead of estimating patternsfrom 6 separated tracks, we can estimate patterns from thefull-audio chroma features, and then duplicate the result 6times to match the number of estimated sources as the pro-posed approach (“Chr. w/o SS”). Second, since the activa-tion matrix is evaluated as if it were a timbre label, we alsoestimate timbre labels by computing full-audio MFCCs,and using NMF to label the measures (“MFCCs”). All sec-tion transitions are treated as predictions of entrances andexits.

5. RESULTS AND DISCUSSION

We applied all the approaches described above to thedataset of 200 Beatles songs. The results for the multi-part pattern description task are shown in Table 1 (“Stan-dard approach”). We find that the proposed approach out-performs the naive baselines, but that the simpler approachthat skips the source separation step performs even better,even though it has lower recall. The pwf values are al-most all dominated by the lower precision values; like instructure analysis, it seems harder to achieve high preci-sion than high recall. By tweaking the evaluation metric,we can understand why. In the bottom half of the table, wecompute pwf counting the elements on the main diagonal.The Bnull baseline, which guesses that every measure isdifferent, now becomes very competitive.

The explanation is that unlike in the usual structureanalysis task, the ground truth for this task is very sparse.Recall that pairwise f -measure tells us how well the sim-ilarity relationships of one description are captured by thesimilarity relationships in another. In other words, giventwo binary SSMs that encode similarity descriptions, pwf

assesses how well the positive parts of these SSMs coin-cide. Since it is trivial to guess that each segment is similarto itself, we should ignore the contributions of the main di-agonal. This does not usually affect the outcome of struc-ture evaluation, since the repeating blocks ensure that the

720 Proceedings of the 18th ISMIR Conference, Suzhou, China, October 23-27, 2017

Page 6: MULTI-PART PATTERN ANALYSIS: COMBINING STRUCTURE … · of multiple independent parts. We argue that estimating the structure of these independent parts i.e., creating a multi-part

Standard approachGenerous Strict

pwf pwp pwr pwf

Proposed .245 .211 .71 .184Chr. w/o SS .297 .312 .529 .222Bconstant .144 .092 .95 .106Bnull 0 0 0 0

Bperiodic .149 .136 .318 .112Bblock .06 .103 .074 .044

Counting self-labeled measurespwf pwp pwr pwf

Proposed .365 .309 .819 .274Chr. w/o SS .466 .442 .695 .346Bconstant .183 .115 .962 .135Bnull .515 .749 .477 .384

Bperiodic .255 .218 .461 .192Bblock .411 .44 .465 .309

Table 1. Above: results for estimated multi-track descrip-tion quality using the proposed metric. Below: the resultsif self-labeled measures are counted as correct. The highretrieval for Bnull illustrates the sparseness of the groundtruth.

ground truth SSM has very many off-diagonal pixels to es-timate. However, in our application, the ground truth ma-trices are extremely sparse: in cases where a part neverrepeats itself exactly, there are no off-diagonal elements.

On the other hand, this task is unlike structure analysisbecause in our case, elements on the main diagonal canequal 0, if the corresponding source is not active. Thismeans that the Bnull baseline does not actually achieveperfect precision: from the bottom part of Table 1 we cansee that on average, sources are active for 75% of the song.

Results for the activation detection task are shown inTable 2. According to the pwf measure, the best approachto characterize the changing timbre of the piece was ourproposed one. However, the uniform baseline performsalmost as well according to this metric. Although somesongs have over a dozen tracks, with many entrances andexits, it seems that the majority of songs have an instru-mentation that changes little. As a result, the uniform andbuildup baselines achieve near-perfect recall while preci-sion does not fall below 30%. That said, these naive base-lines fail to detect nearly all the entrances and exits ofinstruments from the mixture, so the proposed approachbeats them handily on entrance/exit f -measure.

In contrast, the MFCC approach tends to find a majorityof the entrances and exits, and narrowly beats the proposedapproach in terms of entrance/exit f -measure. The cost ofthis apparent over-segmentation is lower pairwise retrieval,and the lowest overall pwf , for labeling the timbres.

In designing the evaluation, we made an effort to re-use metrics that are used for structure analysis. We didnot expect the sparseness of the ground truth to have suchan impact on the metrics, but the impact is plain to seein the success of the baselines. Perhaps we should have

Timbre labeling Entrance/exitpwf pwp pwr f p r

Proposed .450 .456 .546 .248 .271 .296MFCCs .3 .549 .319 .273 .195 .566Buniform .433 .306 .962 0 0 0Bbuildup .446 .328 .909 .071 .351 .045

Table 2. Results for estimated activation matrix quality.

anticipated this: data sparseness is often a problem whentranslating a one-dimensional function (here, the overallstructure) into a higher-dimensional space (a per-channelrepresentation). One potential way to resolve this issueis to automatically process both the ground truth and theestimated descriptions using a fixed sequences-to-blocksconversion step, such as that proposed by [9]. This wouldallow us to compare nearly-equivalent representations thatare much less sparse.

Needless to say, the multi-track analysis approach wehave proposed could be improved in many ways. We haveused two source separation kernels, in a fixed way, but it ispossible to apply more kernels, and to do so in an optimiza-tion framework to increase the independence of the esti-mated tracks [13]. Future work should also test a greatervariety of source separation methods, especially NMF-based approaches. However, this first effort has helped usto understand the special challenge of this task, which isthe sparseness of the ground truth.

6. CONCLUSION AND FUTURE WORK

We have described a new MIR task, multi-part patternanalysis, in which the goal is to describe each indepen-dent layer of a piece of music. The task complements re-cent work on estimating hierarchical structure. We havealso proposed a method for estimating multi-part patternanalyses using a combination of existing source-separationtools, SSM-based structure estimation methods, and anovel approach to thresholding SSMs in order to minimizetransitivity errors.

To support future work on this problem, we have pro-posed a method of creating ground truth annotations fromMIDI files, and a set of evaluation metrics that can esti-mate the similarity between two multi-part descriptions ortwo multi-part activation functions.

In our evaluation, we found the sparseness of the data tobe an issue, but it is a direct consequence of how we choseto create the ground truth. As we refine the methodologyfor this task in future work, we will study the impact of dif-ferent ways of converting multi-channel files into groundtruth recurrence plots.

7. ACKNOWLEDGEMENTS

This work was supported in part by JST ACCEL GrantNumber JPMJAC1602, Japan.

Proceedings of the 18th ISMIR Conference, Suzhou, China, October 23-27, 2017 721

Page 7: MULTI-PART PATTERN ANALYSIS: COMBINING STRUCTURE … · of multiple independent parts. We argue that estimating the structure of these independent parts i.e., creating a multi-part

8. REFERENCES

[1] J.-J. Aucouturier and Mark Sandler. Segmentation ofmusical signals using hidden Markov models. In Proc.of the Audio Engineering Society Convention, Amster-dam, The Netherlands, 2001.

[2] Frederic Bimbot, Gabriel Sargent, Emmanuel Deruty,Corentin Guichaoua, and Emmanuel Vincent. Semioticdescription of music structure: An introduction to theQuaero/Metiss structural annotations. In Proc. of theAES Conference on Semantic Audio, 2014.

[3] Rachel M. Bittner, Justin Salamon, Mike Tierney,Matthias Mauch, Chris Cannam, and Juan PabloBello. MedleyDB: A multitrack dataset for annotation-intensive mir research. In Proc. of ISMIR, pages 155–160, Taipei, Taiwan, 2014.

[4] Sebastian Bock, Filip Korzeniowski, Jan Schluter, Flo-rian Krebs, and Gerhard Widmer. madmom: a newPython Audio and Music Signal Processing Library.In Proc. of the ACM International Conference on Mul-timedia, pages 1174–1178, Amsterdam, The Nether-lands, 2016.

[5] Tom Collins, Andreas Arzt, Sebastian Flossman, andGerhard Widmer. SIARCT-CFP: Improving precisionand the discovery of inexact musical patterns in point-set representations. In Proc. of ISMIR, pages 549–554,Curitiba, Brazil, 2013.

[6] Derry FitzGerald. Harmonic/percussive separation us-ing median filtering and amplitude discrimination. InProc. of the International Conference on Digital AudioEffects, Graz, Austria, September 2010.

[7] Jonathan Foote. Automatic audio segmentation usinga measure of audio novelty. In Proc. of the IEEE In-ternational Conference on Multimedia & Expo, pages452–455, 2000.

[8] Jonathan Foote and Matthew Cooper. Media segmen-tation using self-similarity decomposition. In MinervaYeung, Rainer Lienhart, and Chung-Sheng Li, editors,Proc. of the SPIE: Storage and Retrieval for MediaDatabases, volume 5021, pages 167–175, Santa Clara,CA, USA, 2003.

[9] Harald Groghanz, Michael Clausen, Nanzhu Jiang,and Meinard Muller. Converting path structures intoblock structures using eigenvalue decompositions ofself-similarity matrices. In Proc. of ISMIR, 2013.

[10] Steven Hargreaves, Anssi Klapuri, and Mark San-dler. Structural segmentation of multitrack audio. IEEETransactions on Audio, Speech, and Language Pro-cessing, 20(10):2637–2647, 2012.

[11] Tristan Jehan. Hierarchical multi-class self similarities.In Proc. of the IEEE Workshop on Applications of Sig-nal Processing to Audio and Acoustics, pages 311–314,New Paltz, NY, United States, 2005.

[12] Fred Lerdahl and Ray S. Jackendoff. A Generative The-ory of Tonal Music. MIT Press, 1983.

[13] Antoine Liutkus, Derry Fitzgerald, Zafar Rafii, BryanPardo, and Laurent Daudet. Kernel additive models forsource separation. IEEE Transactions on Signal Pro-cessing, 62(16):4298–4310, 2014.

[14] Antoine Liutkus, Zafar Rafii, Roland Badeau, BryanPardo, and Gael Richard. Adaptive filtering for mu-sic/voice separation exploiting the repeating musicalstructure. In Proc. of the IEEE International Con-ference on Acoustics, Speech and Signal Processing,pages 53–56, Kyoto, Japan, 2012. IEEE.

[15] Patricio Lopez-Serrano, Christian Dittmar, JonathanDriedger, and Meinard Muller. Towards modeling anddecomposing loop-based electronic music. In Proc. ofISMIR, pages 502–508, New York, NY, USA, 2016.

[16] Hanna Lukashevich. Towards quantitative measures ofevaluating song segmentation. In Proc. of ISMIR, pages375–380, Philadelphia, PA, USA, 2008.

[17] Brian McFee and Daniel Ellis. Analyzing song struc-ture with spectral clustering. In Proc. of ISMIR, pages405–410, Taipei, Taiwan, 2014.

[18] Brian McFee, Oriol Nieto, and Juan Pablo Bello. Hier-archical evaluation of segment boundary detection. InProc. of ISMIR, Malaga, Spain, 2015.

[19] Meinard Muller. Music structure analysis. In Funda-mentals of Music Processing: Audio, Analysis, Algo-rithms, Applications, pages 167–236. Springer Inter-national Publishing, 2015.

[20] Meinard Muller, Nanzhu Jiang, and Peter Grosche. Arobust fitness measure for capturing repetitions in mu-sic recordings with applications to audio thumbnailing.IEEE Transactions on Audio, Speech, and LanguageProcessing, 21(3):531–543, 2013.

[21] Oriol Nieto and Morwaread M. Farbood. Identifyingpolyphonic patterns from audio recordings using mu-sic segmentation techniques. In Proc. of ISMIR, pages411–416, Taipei, Taiwan, 2014.

[22] Geoffroy Peeters and Emmanuel Deruty. Is musicstructure annotation multi-dimensional? A proposalfor robust local music annotation. In Proc. of the Inter-national Workshop on Learning the Semantics of AudioSignals, pages 75–90, Graz, Austria, 2009.

[23] Colin Raffel. Learning-Based Methods for ComparingSequences, with Applications to Audio-to-MIDI Align-ment and Matching. PhD thesis, Columbia University,2016.

[24] Colin Raffel and Daniel P. W. Ellis. Intuitive anal-ysis, creation and manipulation of midi data withpretty midi. In ISMIR Late Breaking and Demo Pa-pers, pages 84–93, 2014.

722 Proceedings of the 18th ISMIR Conference, Suzhou, China, October 23-27, 2017

Page 8: MULTI-PART PATTERN ANALYSIS: COMBINING STRUCTURE … · of multiple independent parts. We argue that estimating the structure of these independent parts i.e., creating a multi-part

[25] Zafar Rafii, Antoine Liutkus, and Bryan Pardo. REPETfor background/foreground separation in audio. InG. R. Naik and W. Wang, editors, Blind Source Sepa-ration, Signals and Communication Technology, pages395–411. Springer-Verlag, 2014.

[26] Chris Sanden, Chad R. Befus, and John Z. Zhang.A perceptual study on music segmentation andgenre classification. Journal of New Music Research,41(3):277–293, 2012.

[27] Prem Seetharaman and Bryan Pardo. Simultaneousseparation and segmentation in layered music. In Proc.of ISMIR, pages 495–501, New York, NY, USA, 2016.

[28] Paris Smaragdis. Non-negative matrix factor decon-volution: Extraction of multiple sound sources frommonophonic inputs. In Independent Component Anal-ysis and Blind Signal Separation, volume 3195 ofLecture Notes in Computer Science, pages 494–499.Springer-Verlag, Berlin, Heidelberg, 2004.

[29] Jordan B. L. Smith, J. Ashley Burgoyne, Ichiro Fuji-naga, David De Roure, and J. Stephen Downie. Designand creation of a large-scale database of structural an-notations. In Proc. of ISMIR, pages 555–560, Miami,FL, United States, 2011.

Proceedings of the 18th ISMIR Conference, Suzhou, China, October 23-27, 2017 723