instrogram: probabilistic representation of instrument

Vol. 48 No. 1 IPSJ Journal Jan. 2007

Regular Paper

Instrogram: Probabilistic Representation of

Instrument Existence for Polyphonic Music

Tetsuro Kitahara,† Masataka Goto,†† Kazunori Komatani,†

Tetsuya Ogata† and Hiroshi G. Okuno†

This paper presents a new technique for recognizing musical instruments in polyphonicmusic. Since conventional musical instrument recognition in polyphonic music is performednotewise, i.e., for each note, accurate estimation of the onset time and fundamental frequency(F0) of each note is required. However, these estimations are generally not easy in polyphonicmusic, and thus estimation errors severely deteriorated the recognition performance. Withoutthese estimations, our technique calculates the temporal trajectory of instrument existenceprobabilities for every possible F0. The instrument existence probability is defined as theproduct of a nonspecific instrument existence probability calculated using the PreFEst anda conditional instrument existence probability calculated using hidden Markov models. Theinstrument existence probability is visualized as a spectrogram-like graphical representationcalled the instrogram and is applied to MPEG-7 annotation and instrumentation-similarity-based music information retrieval. Experimental results from both synthesized music and realperformance recordings have shown that instrograms achieved MPEG-7 annotation (instru-ment identification) with a precision rate of 87.5% for synthesized music and 69.4% for realperformances on average and that the instrumentation similarity measure reflected the actualinstrumentation better than an MFCC-based measure.

1. Introduction

The goal of our study is to enable users toretrieve musical pieces based on their instru-mentation. The types of instruments that areused are important characteristics for retriev-ing musical pieces. In fact, the names of cer-tain musical genres, such as “piano sonata”and “string quartet”, are based on instru-ment names. There are two strategies forinstrumentation-based music information re-trieval (MIR). The first allows users to directlyspecify musical instruments (e.g., searching fora piano solo or string music). This strategy isuseful because, unlike other musical elementssuch as chord progressions, specifying instru-ments does not require any special knowledge.The other strategy is Query-by-Example, whereonce users specify a musical piece that they like,the system searches for pieces that have simi-lar instrumentation to the specified piece. Thisstrategy is also particularly useful for automati-cally generating playlists for background music.

The key technology for achieving the above-mentioned MIR is to recognize musical instru-ments from audio signals. Although musical in-

† Department of Intelligence Science and Technology,Graduate School of Informatics, Kyoto University

†† National Institute of Advanced Industrial Scienceand Technology (AIST)

strument recognition studies mainly dealt withsolo musical sounds in the 1990s 1), the num-ber of studies dealing with polyphonic musichas been increasing in recent years. Kashino,et al. 2) developed a computational architecturefor music scene analysis called OPTIMA, whichrecognizes musical notes and instruments basedon the Bayesian probability network. They sub-sequently proposed a method that identifies theinstrument playing each musical note based ontemplate matching with template adaptation 3).Kinoshita, et al. 4) improved the robustness ofOPTIMA to the overlapping of frequency com-ponents, which occurs when multiple instru-ments are played simultaneously, based on fea-ture adaptation. Eggink, et al. 5) tackled thisoverlap problem with the missing feature the-ory. They subsequently dealt with the prob-lem of identifying only the instrument play-ing the main melody based on the assump-tion that the main melody’s partials would suf-fer less from other sounds occurring simultane-ously 6). Vincent, et al. 7) formulated both mu-sic transcription and instrument identificationas a single optimization based on independentsubspace analysis. Essid, et al. 8) achieved F0-estimation-less instrument recognition based ona priori knowledge about the instrumentationfor ensembles. Kitahara, et al. 9) proposed tech-niques to solve the above-mentioned overlap

214

Vol. 48 No. 1 Instrogram: Probabilistic Representation of Instrument Existence 215

problem and to avoid musically unnatural er-rors using musical context.

The common feature of these studies, exceptfor Refs. 7) and 8), is that instrument identi-fication is performed for each frame or eachnote. In the former case 5),6), it is difficult toobtain a reasonable accuracy because temporalvariations in spectra are important characteris-tics of musical instrument sounds. In the lat-ter case 2)∼4),9), the identification system has tofirst estimate the onset time and fundamentalfrequency (F0) of musical notes and then ex-tract the harmonic structure of each note basedon the estimated onset time and F0. Therefore,the instrument identification suffers from errorsof onset detection and F0 estimation. In fact,correct data for the onset times and F0s were in-troduced manually in the experiments reportedin Refs. 3) and 9).

In this paper, we propose a new techniquethat recognizes musical instruments in poly-phonic musical audio signals without usingonset detection or F0 estimation as explicitand deterministic preprocesses. The key con-cept underlying our technique is to visualizethe probability that the sound of each tar-get instrument exists at each time and witheach F0 as a spectrogram-like representationcalled an instrogram. This probability is de-fined as the product of two kinds of proba-bilities, called nonspecific instrument existenceprobability and conditional instrument exis-tence probability, which are calculated using thePreFEst 10) and hidden Markov models, respec-tively. The advantage of our technique is thaterrors due to the calculation of one probabil-ity do not influence the calculation of the otherprobability because the two probabilities can becalculated independently.

In addition, we describe the application ofthe instrogram technique to MPEG-7 annota-tion and MIR based on instrumentation sim-ilarity. To achieve the annotation, we intro-duced two kinds of tags to the MPEG-7 stan-dard. The first is designed to describe the prob-abilities directly, and the second is designedto obtain a symbolic representation such as anevent whereby a piano sound occurs at time t0and continues until t1. Such a representationcan be obtained using the Viterbi search on aMarkov chain with states that correspond toinstruments. To achieve the MIR based on in-strumentation similarity, the distance (dissim-ilarity) between two instrograms is calculated

using dynamic time warping (DTW). A sim-ple prototype system of similarity-based MIRis also achieved based on our instrumentation-based similarity measure.

2. Instrogram

The instrogram is a spectrogram-like graph-ical representation of a musical audio signal,which is useful for determining which instru-ments are used in the signal. In a basic format,an instrogram correponds to a specific instru-ment. The instrogram has horizontal and ver-tical axes representing time and frequency, andthe intensity of the color of each point (t, f)shows the probability p(ωi; t, f) that the targetinstrument ωi is used at time t and at an F0of f . An example is presented in Fig. 1. Thisexample shows the results of analyzing an au-dio signal of “Auld Lang Syne” played on thepiano, violin, and flute. The target instrumentsfor analysis were the piano, violin, clarinet, andflute. If the instrogram is too detailed for somepurposes, it can be simplified by dividing theentire frequency region into a number of subre-gions and merging the results within each sub-region. A simplified version of Fig. 1 is givenin Fig. 2. The original or simplified instrogramshows that the melodies in the high (approx.note numbers 70–80), middle (60–75), and low(45–60) pitch regions are played on flute, violin,and piano, respectively.

Fig. 1 Example of the instrogram.

Color versions of all figures are available at:http://winnie.kuis.kyoto-u.ac.jp/˜kitahara/instrogram/IPSJ07/

216 IPSJ Journal Jan. 2007

Fig. 2 Simplified (summarized) instrogram for Fig. 1.

3. Algorithm for Calculating Instro-gram

Let Ω = ω1, · · · , ωm be the set of targetinstruments. We then have to calculate theprobability p(ωi; t, f) that a sound of the in-strument ωi with an F0 of f exists at time tfor every target instrument ωi ∈ Ω. This prob-ability is called the instrument existence prob-ability (IEP). Here, we assume that multipleinstruments are not being played at the sametime and at the same F0, that is, ∀ωi, ωj ∈ Ω:i = j =⇒ p(ωi∩ωj ; t, f) = 0. Let ω0 denote thesilence event, which means that no instrumentsare being played, and let Ω+ = Ω ∪ ω0. TheIEP then satisfies

∑ωi∈Ω+ p(ωi; t, f) = 1. When

the symbol “X” denotes the union event of alltarget instruments, which stands for the exis-tence of some instrument (i.e., X = ω1 ∪ · · · ∪ωm), the IEP for each ωi ∈ Ω can be calculatedas the product of two probabilities:

p(ωi; t, f) = p(X; t, f) p(ωi|X; t, f),because ωi ∩ X = ωi ∩ (ω1 ∪ · · · ∪ ωi ∪ · · · ∪ωm) = ωi. Above, p(X; t, f), called the nonspe-cific instrument existence probability (NIEP), isthe probability that the sound of some instru-ment with an F0 of f exists at time t, whilep(ωi|X; t, f), called the conditional instrumentexistence probability (CIEP), is the conditionalprobability that, if the sound of some instru-ment with an F0 of f exists at time t, the in-strument is ωi. The probability p(ω0; t, f) isgiven by p(ω0; t, f) = 1 − ∑

ωi∈Ω p(ωi; t, f).3.1 OverviewFigure 3 shows an overview of the algorithm

for calculating an instrogram. Given an audiosignal, the spectrogram is first calculated. Theshort-time Fourier transform (STFT) shifted by

Fig. 3 Overview of our technique for calculating theinstrogram.

10 ms (441 points at 44.1 kHz sampling) withan 8,192-point Hamming window is used in thecurrent implementation. We next calculate theNIEPs and CIEPs. The NIEPs are calculatedusing the PreFEst 10). The PreFEst models thespectrum of a signal containing multiple musi-cal instrument sounds as a weighted mixture ofharmonic-structure tone models at each frame.


For CIEPs, on the other hand, the temporaltrajectory of the harmonic structure with everyF0 is modeled with L-to-R HMMs because thetemporal characteristics are important in rec-ognizing instruments. Once the time series offeature vectors are obtained for every F0, thelikelihoods of the paths in the chain of HMMsare calculated.

The advantage of this technique lies in thefact that p(ωi; t, f) can be estimated robustlybecause the two constituent probabilities arecalculated independently and are then inte-grated by multiplication. In most previousstudies, the onset time and F0 of each notewere first estimated, and then the instrumentfor the note was identified by analyzing spec-tral components extracted based on the resultsof the note estimation. The upper limit ofthe instrument identification performance wastherefore bound by the precedent note estima-tion, which is generally difficult and not robustfor polyphonic music . Unlike such a notewisesymbolic approach, our non-symbolic and non-sequential approach is more robust for poly-phonic music.

3.2 Nonspecific Instrument ExistenceProbability

The NIEP p(X; t, f) is estimated by using thePreFEst on the basis of the maximum likeli-hood estimation without assuming the numberof sound sources in a mixture. The PreFEst,which was originally developed for estimatingF0s of melody and bass lines, consists of threeprocesses: the PreFEst-front-end for frequencyanalysis, the PreFEst-core for estimating therelative dominance of every possible F0, andthe PreFEst-back-end for evaluating the tem-poral continuity of the F0. Because the prob-lem to be solved here is not the estimation ofthe predominant F0s as melody and bass lines,but rather the calculation of p(X; t, f) of everypossible F0, we use only the PreFEst-core.

The PreFEst-core models an observed powerspectrum as a weighted mixture of tone modelsp(x|F ) for every possible F0 F . The tone modelp(x|F ), where x is the log frequency, represents

We tested robustness with respect to onset errors inidentifying an instrument for every note using ourprevious method 9). Giving errors following a nor-mal distribution with a standard deviation of e [s]to onset times, we obtained the following results:

e=0 e=0.05 e=0.10 e=0.15 e=0.2071.4% 69.2% 66.7% 62.5% 60.5%

a typical spectrum of harmonic structures, andthe mixture density p(x; θ(t)) is defined as

p(x; θ(t)) =∫ Fh

Fl

w(t)(F )p(x|F )dF,

θ(t) = w(t)(F )|Fl ≤ F ≤ Fh,where Fl and Fh denote the lower and upperlimits, respectively, of the possible F0 range,and w(t)(F ) is the weight of a tone modelp(x|F ) that satisfies

∫ Fh

Flw(t)(F )dF = 1. If we

can estimate the model parameter θ(t) such thatthe observed spectrum is likely to have beengenerated from p(x; θ(t)), the spectrum can beconsidered to be decomposed into harmonic-structure tone models and w(t)(F ) can be in-terpreted as the relative predominance of thetone model with an F0 of F at time t. Wecan therefore calculate the NIEP p(X; t, f) asthe weight w(t)(f), which can be estimated us-ing the Expectation-Maximization (EM) algo-rithm 10). In the current implementation, weuse the tone model given by

p(x|F )=α

N∑h=1

c(h)G(x; F +1200 log2 h, W),

G(x; m, σ) =1√

2πσ2exp

(− (x − m)2

2σ2

),

where α is a normalizing factor, N = 16, W =17 cent, and c(h) = G(h; 1, 5.5). This tonemodel was also used in the earliest version ofthe PreFEst 11).

3.3 Conditional Instrument ExistenceProbability

The following steps are performed for everyfrequency f .

3.3.1 Harmonic Structure ExtractionThe temporal trajectory H(t, f) of the har-

monic structure (10 harmonics) of which F0 isf is extracted.

3.3.2 Feature ExtractionIt is important to design features that

will effectively recognize musical instruments.Although mel-frequency cepstrum coefficents(MFCCs) and Delta MFCCs are commonlyused in the field of speech recognition studies,we should design features that are optimizedfor musical instrument sounds because musicalinstrument sounds have complicated temporalvariations (e.g., amplitude and frequency mod-ulations). We therefore designed the 28 featureslisted in Table 1 based on our previous stud-


Table 1 Overview of 28 features.

Spectral features

1 Spectral centroid

2 Relative power of fundamental component

3 –10 Relative cumulative power from fun-damental to i-th components (i =2, 3, · · · , 9)

11 Relative power in odd and even compo-nents

12–20 Number of components haveing a dura-tion that is p% longer than the longestduration (p = 10, 20, · · · , 90)

Temporal features

21 Gradient of straight line approximatingpower envelope

22–24 Temporal mean of differentials of powerenvelope from t to t + iT/3 (i = 1, · · · , 3)

Modulation features

25 , 26 Amplitude and frequency of AM

27, 28 Amplitude and frequency of FM

ies 9). For every time t (every 10ms in the im-plementation), we first excerpt a T -length bitof the harmonic-structure trajectory Ht(τ, f)(t ≤ τ < t + T ) from the entire trajectoryH(t, f) and then extract a feature vector x(t, f)consisting of the 28 features from Ht(τ, f). Thedimensionality is then reduced to 12 dimensionsusing principal component analysis with a pro-portion value of 95%. T is 500 ms in the currentimplementation.

3.3.3 Probability CalculationWe train L-to-R HMMs, each consisting of 15

states , for target instruments ω1, · · · , ωm, andthen basically consider the time series of fea-ture vectors, x(t, f), to be generated from aMarkov chain of these HMMs. Then, the CIEPp(ωi|X; t, f) is calculated as

p(Mi|x(t, f)) =p(x(t, f)|Mi)p(Mi)

m∑i=1

p(x(t, f)|Mi)p(Mi)

,

where Mi is the HMM corresponding to the in-strument ωi. p(x(t, f)|Mi) is trained from dataprepared in advance, and p(Mi) is the a prioriprobability.

In the above formulation, p(ωi|X; t, f) forsome instruments may become greater thanzero even if no instruments are played. Theo-retically, this does not matter because p(X; t, f)becomes zero in such cases. In practice, how- We used more states than those used in usual speech

recognition studies (typically three) because thenotes of musical instruments usually have longer du-rations than phonemes.

ever, p(X; t, f) may not be zero, especially whena certain instrument is played at an F0 of an in-teger multiple or factor of f . To avoid this,we prepare an HMM, M0, trained with fea-ture vectors extracted from silent signals (notethat some instruments may be played at non-target F0s) and consider x(t, f) to be gener-ated from a Markov chain of the m + 1 HMMs(M0, M1, · · · , Mm). The CIEP is therefore cal-culated as

p(Mi|x(t, f)) =p(x(t, f)|Mi)p(Mi)

m∑i=0

p(x(t, f)|Mi)p(Mi)

,

where we use p(Mi) = 1/(m + 1).3.4 Simplifying InstrogramsAlthough we calculate IEPs for every F0,

some applications do not need such detailedresults. If the instrogram is used for retriev-ing musical pieces that include a certain instru-ment’s sounds, for example, IEPs for rough fre-quency regions (e.g., high, middle and low) aresufficient. We therefore divide the entire fre-quency region into N subregions I1, · · · , IN andcalculate the IEP p(ωi; t, Ik) for the k-th fre-quency subregion Ik. Here, this is defined asp(ωi; t,

⋃f∈Ik

f), which can be obtained by it-eratively calculating the following equation be-cause the frequency axis is practically discrete.

p(ωi; t, f1 ∪ · · · ∪ fi ∪ fi+1)= p(ωi; t, f1 ∪ · · · ∪ fi) + p(ωi; t, fi+1)

−p(ωi; t, f1 ∪ · · · ∪ fi) p(ωi; t, fi+1),

where Ik = f1, · · · , fi, fi+1, · · · , fnk.

4. Application

Here, we discuss the application of instro-grams to MPEG-7 annotation and MIR.

4.1 MPEG-7 AnnotationThere are two choices for transforming instro-

grams to MPEG-7 annotations. First, we cansimply represent IEPs as a time series of vec-tors. Because the MPEG-7 standard has no tagfor the instrogram annotation, we added severaloriginal tags, as shown in Fig. 4. This exampleshows the time series of eight-dimensional IEPsfor the piano (line 16) with a time resolutionof 10 ms (line 6). Each dimension correspondsto a different frequency subregion, which is de-fined by dividing the entire range from 65.5 Hzto 1,048 Hz (line 3) by 1/2 octave (line 4).

Second, we can transform instrograms into asymbolic representation. We also added several


1:<AudioDescriptor2: xsi:type="AudioInstrogramType"3: loEdge="65.5" hiEdge="1048"4: octaveResolution="1/2">5: <SeriesOfVector totalNumOfSamples="5982"6: vectorSize="8" hopSize="PT10N1000F">7: <Raw mpeg7:dim="5982 8">8: 0.0 0.0 0.0 0.0 0.718 0.017 0.051 0.09: 0.0 0.0 0.0 0.0 0.724 0.000 0.085 0.010: 0.0 0.0 0.0 0.0 0.702 0.013 0.089 0.011: 0.0 0.0 0.0 0.0 0.661 0.017 0.063 0.012: ......13: </Raw>14: </SeriesOfVector>15: <SoundModel16: SoundModelRef="IDInstrument:Piano"/>17:</AudioDescriptor>

Fig. 4 Excerpt of example of instrogram annotation.

1:<MultimediaContent xsi:type="AudioType">2: <Audio xsi:type="AudioSegmentType">3: <MediaTime>4: <MediaTimePoint>T00:00:06:850N10005: </MediaTimePoint>6: <MediaDuration>PT0S200N10007: </MediaDuration>8: </MediaTime>9: <AudioDescriptor xsi:type="SoundSource"10: loEdge="92" hiEdge="130">11: <SoundModel12: SoundModelRef="IDInstrument:Piano"/>13: </AudioDescriptor>14: </Audio>

......

Fig. 5 Excerpt of example of symbolic annotation.

original tags, as shown in Fig. 5. This exampleshows that an event for the piano (line 12) at apitch between 92 and 130 Hz (line 10) occurs at6.850 s (line 4) and continues for 0.200 s (line 6).To obtain this symbolic representation, we haveto estimate the event occurrence and its dura-tion within every frequency subregion Ik. Wetherefore obtain the time series of the instru-ment maximizing p(ωi; t, Ik) and then considerit to be an output of a Markov chain, states ofwhich are ω0, ω1, · · · , ωm. (Fig. 6). The tran-sition probabilities in the chain from a stateto the same state, from non-silence states tothe silence state, and from the silence state tonon-silence states are greater than zero, and theother probabilities are zero. After obtaining themost likely path in the chain, we can estimatethe occurrence and duration of an instrumentωi from the transitions between the states ω0

and ωi. This method assumes that only one in-strument is played at the same time in each fre-quency subregion. When multiple instrumentsare played in the same subregion at the sametime, the most predominant instrument will be

Fig. 6 Markov chain model used in symbolic annota-tion. The values are transition probabilities,where pst is the probability of staying at thesame state at the next time, which was experi-mentally determined as 1 − 10−16.

annotated.4.2 Music Information Retrieval based

on Instrumentation SimilarityOne of the advantages of the instrogram,

which is a non-symbolic representation, is toprovide a new instrumentation-based similar-ity measure. The similarity between two in-strograms enables the MIR based on instru-mentation similarity. As we pointed out in theIntroduction, this key technology is importantfor automatic playlist generation and content-based music recommendation. Here, insteadof calculating similarity, we calculate the dis-tance (dissimilarity) between instrograms byusing DTW as follows:( 1 ) A vector pt for every time t is obtained

by concatenating the IEPs of all instruments:pt =(p(ω1;t,I1), p(ω1;t,I2), · · · , p(ωm;t,IN ))′,

where ′ is the transposition operator.( 2 ) The distance between two vectors, p and

q, is defined as the cosine distance:dist(p, q) = 1 − (p, q)/||p||·||q||,

where (p, q) = p′Rq, and ||p|| =√

(p, p).R = (rij) is a positive definite symmetric ma-trix. By setting rij with respect to related el-ements, e.g., violin vs. viola, to a value greaterthan zero, one can consider the similarity be-tween such related instruments. When R isthe unit matrix, (p, q) and ||p|| are equivalentto the standard inner product and norm.

( 3 ) The distance (dissimilarity) betweenpt and qt is calculated by applying DTWwith the above-mentioned distance measure.

The timbral similarity was also used in previousMIR-related studies 12),13). The timbral simi-larity was calculated on the basis of spectral fea-tures, such as MFCCs, directly extracted from


(a) FL–VN–PF (b) VN–CL–PF (c) PF–PF–PF

(d) PF–CL–PF (e) PF–VN–PF (f) VN–VN–PF

Fig. 7 Results of calculating instrograms from “Auld Lang Syne” with sixdifferent instrumentations. “FL–VN–PF” means that the treble, mid-dle, and bass parts are played on flute, violin, and piano, respectively.

complex sound mixtures. Such features some-times do not clearly reflect actual instrumen-tation, as will be implied in the next section,because they are influenced not only by instru-ment timbres but also by arrangements, includ-ing the voicing of chords. On the other hand,because instrograms directly represent instru-mentation, this will facilitate the appropriatecalculation of the similarity of instrumentation.Moreover, instrograms have the following ad-vantages:Intuitiveness The musical meaning is intu-

itively clear.Controllability By appropriately setting R,

users can ignore the differences between pitchregions within the same instrument and/orthe difference between instruments within thesame instrument family.

5. Experiments

We conducted experiments on obtaining in-strograms and symbolic annotations for bothaudio data generated on a computer and therecordings of real performances. In addition,we tested the calculation of the similarity be-tween instrograms for the real performances.

5.1 Use of Generated Audio DataWe first conducted experiments on obtain-

ing instrograms from audio signals of trio mu-sic “Auld Lang Syne” used by Kashino et al 3).The audio signals were generated by mixing au-dio data from RWC-MDB-I-2001 14) (VariationNo. 1) according to a standard MIDI file (SMF)that we input using a MIDI sequencer based onKashino’s score. The target instruments werethe piano (PF), violin (VN), clarinet (CL), andflute (FL). The training data for these instru-ments were taken from the audio data in RWC-MDB-I-2001 with Variation Nos. 2 and 3. Thetime resolution was 10 ms, and the frequencyresolution was every 100 cent. The width ofeach frequency subregion for the simplificationwas 600 cent. We used HTK 3.0 for HMMs.

The results are shown in Fig. 7. When wecompare (a) and (b), (a) has high IEPs for theflute in high-frequency regions while (b) hasvery low (almost zero) IEPs. In contrast, (a)has very low (almost zero) IEPs for the clarinetand (b) has high IEPs. Also, (d) has high IEPsfor the clarinet and almost zero IEPs for theviolin whereas (e) has high IEPs for the violinand almost zero IEPs for the clarinet. In thecase of (c), the IEPs only for the piano are suf-ficiently high. Although both (e) and (f) areplayed on the piano and violin, the IEPs forthe violin in the highest frequency region are


different. This correctly reflects the differencebetween the actual instrumentations.

Based on the instrograms obtained above, weconducted experiments on symbolic annotationusing the method described in Section 4.1. Wefirst prepared ground truth (correct data) fromthe SMF used to generate the audio signals andthen evaluated the results based on the recallrate R and precision rate P given by

R =

m∑i=1

N∑k=1

(# frames correctly

annotated as ωi at Ik

)

m∑i=1

N∑k=1

(# frames that should be


) ,

P =

m∑i=1

N∑k=1

(# frames correctly


)

m∑i=1

N∑k=1

(# frames annotated

as ωi at Ik

) .

The results are shown in Table 2. We achieveda precision rate of 78.7% on average. Althoughthe recall rates were not high (14–38%), we con-sider the precision rates to be more importantthan the recall rates for MIR; a system can userecognition results even if some frames or fre-quency subregions are missing, whereas false re-sults have a negative influence on MIR.

We also evaluated symbolic annotation bymerging all the frequency subregions; in otherwords, we ignored the differences between fre-quency subregions. This was because instru-ment annotation is useful even without F0 in-formation for MIR. For example, a task such assearching for piano solo pieces can be achievedwithout F0 information. The evaluation wasconducted based on the recall rate R′ and pre-cision rate P ′. The recall and precision ratesfor this evaluation are given by

R′=∑

i(#frames correctly annotated as ωi)∑i(#frames that should be annotated as ωi)

,

P ′=∑

i(#frames correctly annotated as ωi)∑i(#frames annotated as ωi )

.

The results are listed in Table 3. The averageprecision rate was 87.5% and the maximum was95.4% for FL–VN–PF. The precision rates for allpieces were over 80%, while the recall rates wereapproximately between 30 and 60%.

5.2 Use of Real PerformancesWe next conducted experiments on obtaining

Table 2 Results of symbolic annotation for “AuldLang Syne.”

Recall Precision

FL–CL–PF 28.7% 63.4%FL–PF–PF 38.5% 89.4%FL–VN–PF 37.2% 89.5%PF–CL–PF 22.2% 79.3%PF–PF–PF 26.0% 93.5%PF–VN–PF 24.2% 76.6%VN–CL–PF 21.4% 63.6%VN–PF–PF 14.3% 76.1%VN–VN–PF 30.2% 76.9%

Average 27.0% 78.7%

Table 3 Results of symbolic annotation for “AuldLang Syne”(all frequency subregions merged).

Recall Precision

FL–CL–PF 36.0% 80.3%FL–PF–PF 56.8% 87.6%FL–VN–PF 44.5% 95.4%PF–CL–PF 40.5% 84.4%PF–PF–PF 62.2% 91.4%PF–VN–PF 40.5% 88.1%VN–CL–PF 29.2% 87.6%VN–PF–PF 34.9% 86.7%VN–VN–PF 40.8% 85.3%

Average 42.8% 87.5%

Table 4 Musical pieces used and theirinstrumentations.

(i) No. 12, 14, 21, 38 StringsClassical (ii) No. 19, 40 Piano+Strings

(iii) No. 43 Piano+Flute

Jazz (iv) No. 1, 2, 3 Piano solo

instrograms from the recordings of real perfor-mances of classical and jazz music taken fromthe RWC Music Database 15). The instrumen-tation of all pieces is listed in Table 4. We onlyused the first one-minute signal for each piece.The experimental conditions were basically thesame as those in Section 5.1. Because the targetinstruments were the piano, violin, clarinet, andflute, the IEPs for the violin should also be highwhen string instruments other than the violinare played, and the IEPs for the clarinet shouldalways be low. The training data were takenfrom both RWC-MDB-I-2001 14) and NTTMSA-P1 (a non-public musical sound database) .

The results, shown in Fig. 8, show that (a)and (b) have high IEPs for the violin while

The database called NTTMSA-P1 consists of iso-lated monophonic tones played by two different in-dividuals for each instrument. Every semitone overthe pitch range is played with three different inten-sities for each instrument.


(a) RM-C No. 12 (Str.) (b) RM-C No. 14 (Str.) (c) RM-C No. 40 (Pf.+Str.)

(d) RM-C No. 43 (Pf.+Fl.) (e) RM-J No. 1 (Pf.) (f) RM-J No. 2 (Pf.)

Fig. 8 Results of calculating instrograms from audio signals of real perfor-mances. RM-C stands for RWC-MDB-C-2001 and RM-J stands forRWC-MDB-J-2001.

(e) and (f) have high IEPs for the piano. For(c), the IEPs for the violin increase after 10 sec,whereas those for the piano are initially high.This reflects the actual performances of theseinstruments. When (d) is compared to (e) and(f), the former has slightly higher IEPs for theflute than the latter, although the difference isunclear. In general, the IEPs are not as clear asthose for signals generated by copy-and-pastingwaveforms of RWC-MDB-I-2001. This is be-cause the acoustic characteristics of real per-formances have greater variety. This could beimproved by adding appropriate training data.

We also evaluated the symbolic annotationfor these real-performance recordings. Theevaluation was only conducted for the case inwhich all frequency subregions were merged be-cause it is difficult to manually prepare a re-liable ground truth for each frequency subre-gion . The results are listed in Table 5. Theaverage precision rate was 69.4% and the max-imum was 84.3%, which were lower than thosefor synthesized music. This would also be be-cause of the great variety in the acoustic charac- Although the SMF corresponding to each piece is

available in the RWC Music Database, it cannot beused because the SMF and audio signal are not syn-chronized.

Table 5 Results of symbolic annotation for realrecordings (all frequency subregions merged).

Recall Precision

RM-C No. 12 78.0% 63.4%14 76.0% 74.0%19 45.1% 65.6%21 89.9% 70.0%38 65.1% 64.0%40 50.8% 71.5%43 49.7% 84.3%

RM-J No. 1 62.1% 72.0%2 75.6% 69.3%3 45.9% 59.7%

Average 63.8% 69.4%

teristics of real performances. The recall rates,in contrast, were higher than those for synthe-sized music because the same instrument wasoften simultaneously played over multiple fre-quency subregions, in which the instrument wasregarded as correctly recognized if it was recog-nized in any of these subregions.

5.3 Similarity CalculationWe tested the calculation of the dissimilari-

ties between instrograms. The results, listed inTable 6 (a), can be summarized as follows:• The dissimilarities within each group were

generally less than 7,000 (except Group (ii)).• Those between Groups (i) (played on

strings) and (iv) (piano) were generally


Table 6 Dissimilarities in Instrumentation between musical pieces.

(a) Using IEPs (instrograms)

(i) (ii) (iii) (iv) 3-best-similarityC12 C14 C21 C38 C19 C40 C43 J01 J02 J03 pieces

C12 0 C21, C14, C38C14 6429 0 C21, C12, C38C21 5756 5734 0 C14, C12, C38C38 7073 6553 6411 0 C21, C14, C38C19 7320 8181 7274 7993 0 C21, C12, C38C40 8650 8353 8430 8290 8430 0 J02, J01, C43C43 8910 9635 9495 9729 8148 8235 0 J01, J02, J03J01 9711 10226 10252 10324 8305 8214 6934 0 J02, J03, C43J02 9856 10125 10033 10610 8228 8139 7216 6397 0 J01, C43, J03J03 9134 9136 8894 9376 8058 8327 7480 6911 7223 0 J01, J02, C43

(b) Using MFCCs

(i) (ii) (iii) (iv) 3-best-similarityC12 C14 C21 C38 C19 C40 C43 J01 J02 J03 pieces

C12 0 C21, C40, J02C14 17733 0 C43, C12, J02C21 17194 18134 0 C12, J01, J02C38 18500 18426 18061 0 J01, J02, C21C19 17510 18759 18222 19009 0 J02, C12, J03C40 17417 19011 18189 19099 18100 0 C12, J02, J01C43 18338 17459 17728 18098 18746 18456 0 J01, C14, J02J01 17657 17791 17284 17834 18133 17983 16762 0 J02, C43, J03J02 17484 17776 17359 18009 17415 17524 17585 15870 0 J01, J03, C21J03 17799 18063 17591 18135 17814 18038 17792 16828 16987 0 J01, J02, C21

Note: “C” and “12” of “C12”, for example, represents a database (Classical/Jazz) and a piece number, respectively.

greater than 9,000, and some were greaterthan 10,000.

• Those between Groups (i) and (iii) (pi-ano+flute) were also around 9,000.

• Those between Groups (i) and (ii) (pi-ano+strings), (ii) and (iii), and (ii) and (iv)were around 8,000. As one instrument iscommonly used in these pairs, these dis-similarities were reasonable.

• Those between Groups (iii) and (iv) werearound 7,000. Because the difference be-tween these groups is only the presence ofthe flute, these were also reasonable.

For comparison, Table 6 (b) lists the resultsobtained using MFCCs. The 12-dimensionalMFCCs were extracted every 10ms with a 25-ms Hamming window. No Delta MFCCs wereused. After the MFCCs were extracted, thedissimilarity was calculated using the methoddescribed in Section 4.2, where pt was asequence of 12-dimensional MFCC vectors in-stead of IEP vectors. Comparing the resultswith the two methods, we can see the followingdifferences:• The dissimilarities within Group (i) and

the dissimilarities between Group (i) andthe others for IEPs differed more thanthose for MFCCs. In fact, all the three-

best-similarity pieces from those in Group(i) belonged to the same group, i.e., (i),for IEPs, while those for MFCCs containedpieces out of Group (i).

• None of the three-best-similarity piecesfrom the four pieces without strings(Groups (iii) and (iv)) contained strings forIEPs, whereas those for MFCCs containedpieces with strings (C14, C21).

We also developed a simple prototype systemthat searches pieces that have similar instru-mentation to that specified by the user. Thedemonstration of our MIR system and othermaterials will be available at: http://winnie.

kuis.kyoto-u.ac.jp/˜kitahara/instrogram/IPSJ07/.

6. Discussion

This study makes three major contributionsto instrument recognition and MIR.

The first is the formulation of instrumentrecognition as the calculation of NIEPs andCIEPs. Because the calculation of NIEPs in-cludes a process that can be considered to bean alternative to the estimation of onset timesand F0s, this formulation has made it possi-ble to omit their explicit estimation, which isdifficult for polyphonic music. Based on simi-lar motivations, Vincent, et al. 7) and Essid, et


al. 8) proposed new instrument recognition tech-niques. Vincent, et al.’s technique involves bothtranscription and instrument identification in asingle optimization procedure. This techniqueis based on a reasonable formulation and isprobably effective but has only been tested onsolo and duo excerpts. Essid, et al.’s techniqueidentifies the instrumentation, instead of theinstrument for each part, from a pre-designedpossible-instrumentation list. This technique isbased on the standpoint that music usually hasone of several typical instrumentations. Theyreported successful experimental results, butidentifying instrumentations other than thoseprepared is impossible. Our instrogram tech-nique, in contrast, has made it possible torecognize instrumentation without making anyassumptions about instrumentation for audiodata, including synthesized music and real per-formances that have various instrumentations.

The second contribution is the establishmentof a graphical representation of instrumenta-tion. Previous studies on music visualizationwere generally based on MIDI-level symbolicinformation 16),17). Although spectrograms andspecmurts 18) are useful for visualization of au-dio signals, it is difficult to recognize instrumen-tation from them. Our instrogram representa-tion provides visual information on instrumen-tation, which should be useful for generatingmusic thumbnails and other visualizations suchas animations for entertainment.

The third contribution is the achievement ofan MIR based on instrumentation similarity.Although both timbral similarity calculationand instrument recognition have been activelyinvestigated, no attempts of calculating the in-strumentation similarity on the basis of instru-ment recognition techniques have been made,because previous instrument recognition haveaimed to determine the instruments that areplayed in given signals. The instrogram, whichrepresents instrumentation as a set of continu-ous values, is an effective approach to designinga continous similarity measure.

7. Conclusion

We described a new instrogram represen-tation obtained by using a new musical in-strument recognition technique that explicitlyuses neither onset detection nor F0 estimation.Whereas most previous studies first estimatedthe onset time and F0 of each note and thenidentified the instrument for each note, our

technique calculates the instrument existenceprobability for each target instrument at eachpoint on the time-frequency plane. This non-note-based approach made it possible to avoidadverse influences caused by errors of the onsetdetection and F0 estimation.

In addition, we presented methods for apply-ing the instrogram technique to MPEG-7 anno-tation and the MIR based on instrumentationsimilarity. Although the note-based outputsof previous instrument recognition were diffi-cult to apply to continuous similarity calcula-tion, our instrogram representation provided ameasure for similarities in instrumentation andachieved MIR based on this measure.

The instrogram is related to Goto’s study onmusic understanding based on the idea thatpeople understand music without mentally rep-resenting it as a score 19). Goto claims thatgood music descriptors should be musically in-tuitive, fundamental to a professional methodof understanding music, and useful for variousapplications. We believe that the instrogramsatisfies these requirements and therefore in-tend to apply it to the professional score-basedmusic understanding as well as various applica-tions including MIR.

Acknowledgments This research was sup-ported in part by the Ministry of Educa-tion, Culture, Sports, Science and Technol-ogy (MEXT), Grant-in-Aid for Scientific Re-search and Informatics Research Center for De-velopment of Knowledge Society Infrastructure(COE program of MEXT, Japan). We usedthe RWC Music Database and NTTMSA-P1.We would like to thank everyone who has con-tributed to building these databases and wishto thank NTT Communication Science Labora-tories for giving us permission to use NTTMSA-P1. The experiment presented in the appendixwas conducted by Mr. Katsutoshi Itoyama. Weappreciate his cooperation. We would also liketo thank Dr. Shinji Watanabe (NTT) for hisvaluable comments.

References

1) Martin, K.D.: Sound-Source Recognition: ATheory and Computational Model, PhD The-sis, MIT (1999).

2) Kashino, K., Nakadai, K., Kinoshita, T. andTanaka, H.: Application of the Bayesian Prob-ability Network to Music Scene Analysis, Com-putational Auditory Scene Analysis, Rosenthal,D.F. and Okuno, H.G. (Eds.), pp.115–137,


Lawrence Erlbaum Associates (1998).3) Kashino, K. and Murase, H.: A Sound

Source Identification System for Ensemble Mu-sic based on Template Adaptation and Mu-sic Stream Extraction, Speech Comm., Vol.27,pp.337–349 (1999).

4) Kinoshita, T., Sakai, S. and Tanaka, H.: Mu-sical Sound Source Identification based on Fre-quency Component Adaptation, Proc. IJCAICASA Workshop, pp.18–24 (1999).

5) Eggink, J. and Brown, G.J.: Application ofMissing Feature Theory to the Recognitionof Musical Instruments in Polyphonic Audio,Proc. ISMIR (2003).

6) Eggink, J. and Brown, G.J.: ExtractingMelody Lines from Complex Audio, Proc. IS-MIR, pp.84–91 (2004).

7) Vincent, E. and Rodet, X.: Instrument Iden-tification in Solo and Ensemble Music usingIndependent Subspace Analysis, Proc. ISMIR,pp.576–581 (2004).

8) Essid, S., Richard, G. and David, B.: Instru-ment Recognition in Polyphonic Music, Proc.ICASSP, Vol.III, pp.245–248 (2005).

9) Kitahara, T., Goto, M., Komatani, K., Ogata,T. and Okuno, H.G.: Instrument Identificationin Polyphonic Music: Feature Weighting withMixed Sounds, Pitch-dependent Timbre Mod-eling, and Use of Musical Context, Proc. IS-MIR, pp.558–563 (2005).

10) Goto, M.: A Real-time Music-scene-descriptionSystem: Predominant-F0 Estimation for De-tecting Melody and Bass Lines in Real-worldAudio Signals, Speech Comm., Vol.43, No.4,pp.311–329 (2004).

11) Goto, M.: A Robust Predominant-F0 Esti-mation Method for Real-time Detection ofMelody and Bass Lines in CD Recordings,Proc. ICASSP, Vol.II, pp.757–760 (2000).

12) Tzanetakis, G. and Cook, P.: Musical GenreClassification of Audio Signals, IEEE Trans.Speech Audio Process., Vol.10, No.5, pp.293–302 (2002).

13) Aucouturier, J.-J. and Pachet, F.: Music Sim-ilarity Measure: What’s the Use?, Proc.ISMIR,pp.157–163 (2002).

14) Goto, M., Hashiguchi, H., Nishimura, T.and Oka, R.: RWC Music Database: Mu-sic Genre Database and Musical InstrumentSound Database, Proc. ISMIR, pp.229–230(2003).

15) Goto, M., Hashiguchi, H., Nishimura, T. andOka, R.: RWC Music Database: Popular, Clas-sical, and Jazz Music Databases, Proc. ISMIR,pp.287–288 (2002).

16) Hiraga, R.: A Look of Performance, IEEE Vi-sualization (2002).

17) Hiraga, R., Miyazaki, R. and Fujishiro, I.: Per-formance Visualization—A New Challenge toMusic through Visualization, ACM Multimedia(2002).

18) Sagayama, S., Takahashi, K., Kameoka, H.and Nishimoto, T.: Specmurt Anasylis: APiano-roll-visualization of Polyphonic Musicby Deconvolution of Log-frequency Spectrum,Proc. SAPA (2004).

19) Goto, M.: Music Scene Description Project:Toward Audio-based Real-time Music Under-standing, Proc. ISMIR, pp.231–232 (2003).

20) Kameoka, H., Nishimoto, T. and Sagayama,S.: Harmonic-temporal-structured Clusteringvia Deterministic Annealing EM Algorithmfor Audio Feature Extraction, Proc. ISMIR,pp.115–122 (2005).

Appendix

A.1 Experiment on Notewise Instru-ment Recognition

This Appendix presents the results of an ex-periment on notewise instrument recognition.The problem that we deal with here is to de-tect a note sequence played on a specified in-strument. We first estimate the onset timeand F0 of every note using harmonic-temporal-structured clustering (HTC) 20). For everynote, we then calculate the likelihood that thenote would be played on the specified instru-ment using our previous instrument identifica-tion method 9). Finally, we select the note thathas the maximal likelihood every time. Notethat we assume that multiple notes are notsimultaneously played on the specified instru-ment. We used “Auld Lang Syne” played onthe piano, violin, and flute as a test sample.This was synthesized by mixing audio data fromthe RWC-MDB-I-2001 14) similarly to the exper-iment reported in Section 5.1. Only if a de-tected note is actually played on the specifiedinstrument and at the correct F0 (note num-ber) and the error of the estimated onset timeis less than e [s], the note is judged to be cor-rect. Recognition was evaluated through thefollowing:

R =# correctly detected notes

# notes that should be detected,

P =# correctly detected notes

# detected notes.

The results are shown in Table 7. Recognitionespecially for the violin and flute was not suf-ficiently accurate. Although the results of thisexperiment cannot be directly compared with


Table 7 Results of experiment on notewise approach.

e = 0.2 e = 0.4 e = 0.6PF VN FL PF VN FL PF VN FL

R 59% 36% 31% 59% 42% 34% 60% 47% 48%P 86% 26% 17% 86% 24% 19% 87% 27% 27%

those obtained using our approach because theywere evaluated in different ways, we can see thatthe notewise approach is not robust.

(Received May 8, 2006)(Accepted October 3, 2006)

(Online version of this article can be found inthe IPSJ Digital Courier, Vol.3, pp.1–13.)

Tetsuro Kitahara receivedhis B.S. from the Tokyo Univer-sity of Science in 2002 and hisM.S. from Kyoto University in2004. He is currently a Ph.D.student at Graduate School ofInformatics of Kyoto University.

He has been a Research Fellow of the Japan So-ciety for the Promotion of Science since 2005.His research interests include music informat-ics. He has received several awards includingthe TELECOM System Technology Award forStudent and the IPSJ 67th National Conven-tion Best Paper Award for Young Researcher.

Masataka Goto receivedhis Doctor of Engineering de-gree in electronics, information,and communication engineeringfrom Waseda University, Japan,in 1998. He is currently a Se-nior Research Scientist of the

National Institute of Advanced Industrial Sci-ence and Technology (AIST). He served con-currently as a Researcher in Precursory Re-search for Embryonic Science and Technol-ogy (PRESTO), Japan Science and TechnologyCorporation (JST) from 2000 to 2003, and hasbeen an Associate Professor of the Departmentof Intelligent Interaction Technologies, Grad-uate School of Systems and Information En-gineering, University of Tsukuba, since 2005.He has received 18 awards, including the IPSJBest Paper Award, IPSJ Yamashita SIG Re-search Awards, and Interaction 2003 Best Pa-per Award.

Kazunori Komatani re-ceived his B.E., M.S. and Ph.D.degrees from Kyoto University,Japan, in 1998, 2000, and 2002.He is currently an Assistant Pro-fessor at the Graduate School ofInformatics of Kyoto University,

Japan. He received the 2002 FIT Young Re-searcher Award and the 2004 IPSJ YamashitaSIG Research Award, both from the Informa-tion Processing Society of Japan. His researchinterests include spoken dialogue systems.

Tetsuya Ogata received hisB.E., M.S. and Ph.D. degreesfrom Waseda University, Japan,in 1993, 1995, and 2000. Heis currently an Associate Profes-sor at the Graduate School ofInformatics of Kyoto University,

Japan. He served concurrently as a VisitingAssociate Professor of the Humanoid RoboticsInstitute of Waseda University, and Vising Sci-entist of the Brain Science Institute of RIKEN.His research interests include multi-modal ac-tive sensing and robot imitation.

Hiroshi G. Okuno receivedhis B.A. and Ph.D. from theUniversity of Tokyo in 1972 and1996. He worked for NTT, JSTand the Tokyo University of Sci-ence. He is currently a profes-sor of in the Graduate School of

Informatics at Kyoto University. He is cur-rently engaged in computational auditory sceneanalysis, music scene analysis and robot au-dition. He has received various awards in-cluding the 1990 Best Paper Award of JSAI,the Best Paper Award of IEA/AIE-2001 and2005, and IEEE/RSJ Nakamura Award forIROS-2001 Best Paper Nomination Finalist.He was also awarded 2003 Funai Informa-tion Science Achievement Award. He editedwith David Rosenthal “Computational Audi-tory Scene Analysis” (LEA, 1998) and with Tai-ichi Yuasa “Advanced Lisp Technology” (T&F,2002).

instrogram: probabilistic representation of instrument

Documents