mir/mdl evaluation: making progress · recommendations of the “mir/mdl evaluation frameworks...

MIR/MDL Evaluation: Making Progress

J. Stephen Downie Graduate School of Library and Information Science

University of Illinois at Urbana-Champaign +1-(217) 265-5018

[email protected]

1. INTRODUCTION The papers in Part I of this 3rd Edition of the “The MIR/MDL Evaluation Project White Paper Collection” represent the first round of formalized MIR/MDL community input on the topic of MIR/MDL evaluation. Each were presented at the Workshop on the Creation of Standardized Test Collections, Tasks and Metrics for Music Digital Library (MDL) and Music Information Retrieval (MIR) Evaluation, held 18 July, 2002 at the ACM/IEEE Joint Conference on Digital Libraries. The Part I papers were originally published as "The MIR/MDL Evaluation Project White Paper Collection, Edition #1". To these first edition papers were added the papers which follow in Part II. The aggregation of the Part I and Part II papers together represent the second edition of the "White Paper Collection". The Part II papers are those presented at the Panel on Music Information Retrieval Evaluation Frameworks, held 17 October, 2002, at the 3rd International Conference on Music Information Retrieval (ISMIR 2002), Paris, FR. The Part III papers to which this is the introduction are those to be presented at the Workshop on the Evaluation of Music Information Retrieval (MIR) Systems, August 1, 2003, Toronto, Canada as part of the 26th Annual International SIGIR Conference (SIGIR 2003).

It the purpose of this paper to provide the reader with an updated summary of the significant progress being made to establish meaningful MIR/MDL evaluation frameworks.

2. Synopsis of Key Findings and Recommendations from “MIR/MDL Evaluation Frameworks Project.” i

Below summarizes the principal findings and recommendations that flow from my analyses of MIR/MDL community input. There is clearly a pressing need for the:

1. Creation of a large-scale, common corpus of music materials that is accessible to the MIR/MDL community.

2. Creation of TREC-like evaluation mechanisms, tailored to the needs of MIR/MDL developers.

3. Explicit capturing and analysis of a wide variety real-world music queries upon which to base the creation of the query records to be used in testing.

4. Development of formal requirements for the necessary elements (and their constituent data types) to be used in the query records.

5. Validation of the “ reasonable person” relevance judgement assumption through inter-rater reliability studies.

6. Continued acquisition of more music information (audio, symbolic, and metadata) with a special effort to acquire “ top hits” popular music and more non-Western musics to make real-world, real-time, user studies a possibility. The acquisition of non-Western musics is particularly important as there is a strongly perceived bias toward Western music within current MIR research.

3. Progressing Toward TREC-like Evaluations Unlike the text IR community, MIR research is still lacking a set of common collections and evaluation techniques with which the teams can scientifically standardize the comparisons of their systems (i.e., there is no MIR TREC). Gaining access to large collections of audio, symbolic and metadata information for use in standardized testing has been hindered by copyright and technological constraints. To help overcome these constraints, the University of Illinois at Urbana-Champaign has begun work on the creation of the International Music Information Retrieval System Evaluation Laboratory (IMIRSEL). Supported by the National Science Foundation, the Andrew W. Mellon Foundation, and the National Center for Supercomputing Applications (NCSA), IMIRSEL has been established to provide bona fide research organizations access to a unique collection of music materials in multiple formats (e.g., audio, symbolic, and metadata). This collection, physically housed on NCSA’s supercomputing servers, will be accessible via the set of secure Virtual Research Laboratories (VRLs) that we are currently developing. These VRLs will afford MIR research projects the means to interact with a common corpus of copyrighted music materials on a scale hitherto unknown to the MIR research community (i.e., multi-terabytes). IMIRSEL’s first-phase collection will consist of: 1) some 30,000+ digital audio recordings from the HNH music catalogs (owners of the Naxos and Marco Polo recording labels) representing such genres as classical, jazz and Asian folk; and, 2) the expansive metadata collection being provided by All Media Guide (owners of Allmusic.com). The IMIRSEL systems should be available for initial research use at the end of Fall Term 2003. ii

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.

79

4. The Andrew W. Mellon Foundation Proposal Over the Spring of 2003, the Andrew W. Mellon Foundation gave us permission to submit a large-scale grant proposal to help the MIR/MDL community establish the necessary tools and organizational structures for meaningful evaluation work. Below is a verbatim extract from the proposal submitted 22 July 2003. We expect to hear back from Mellon about the status of the proposal in early Fall 2003.

4.1 Executive Summary: Extract from Proposal [BEGIN EXTRACT] This proposed research project is designed to enhance the important and significant work being done by the Music Information Retrieval (MIR) and Music Digital Library (MDL) research communities by providing an opportunity for these communities to realize the establishment of sorely-needed evaluation tools. This proposal builds upon the ongoing efforts being made to establish TREC-like and other comprehensive evaluation paradigms within the MIR/MDL research communities. The proposed research tasks are based upon expert opinion garnered from members of the Information Retrieval (IR), MDL and MIR communities with regard to the construction and implementation of scientifically valid evaluation frameworks. The proposed research has two complementary tracks: 1) the establishment of internationally accessible mechanisms and evaluation standards for the comprehensive evaluation of MIR and MDL systems; and, 2) the formal investigation of the human factors involved in the creation, use and evaluation of MIR and MDL systems. Based on a three-year budgeting outline, the proposed project has a four-year time-span (1 October 2003 to 30 September 2007) and is seeking approximately $390,000 in financial support from the Andrew W. Mellon Foundation. Project deliverables include:

1. The creation and refinement of secure access mechanisms that will allow the manipulation of a unique, large-scale standard corpus of music materials for the research and evaluation use of the international MIR/MDL research community.

2. The creation, refinement and dissemination of a TREC-like evaluation scenario based upon the special needs and requirements of MIR/MDL community.

3. The creation, refinement and dissemination of a collection of standardized query documents, based upon the real-world expression of user needs.

4. The creation, refinement and dissemination of a deeper and more comprehensive understanding of the behavior of MIR/MDL systems, their uses and their users.

5. The establishment of a framework of support with which to ensure the long-term, post-funding, sustainability of the project and its materials.

The proposed project has recently been awarded a one-year (approximately $100,000) seed-funding grant from the National Science Foundation (NSF). The NSF funding is designed to facilitate the creation of the basic technological, infrastructural

and organizational foundations upon which this more ambitious and broad-reaching project will be based. [END EXTRACT]

5. Next Steps After the SIGIR 2003 Workshop meeting we will begin to form an International Advisory Board to help provide input on the running of the IMIRSEL resources. We are planning on convening another evaluation panel as part of the 4th International Conference on Music Information Retrieval (ISMIR 2003) to be held 30 October, 2003 in Baltimore, MD. With luck and hard work, we hope to unveil the basic IMIRSEL infrastructure at this meeting. We are also in the process of working with an important online peer-reviewed journal board to investigate the possibility of presenting these White Papers as part of a special edition devoted to MIR/MDL evaluation issues.

6. Acknowledgements The first three persons whom I must thank and acknowledge are Drs. Ellen Voorhees, Edie Rasmussen and Beth Logan for so ably giving of themselves as the project's keynote presenters. Their input has helped inspire both myself and the other contributors to think of the MIR/MDL evaluation problem in a new, and better informed light. Next I would like to thank Drs. Don Waters and Suzanne Lodato and the Andrew W. Mellon Foundation for their ongoing moral and financial support of the project. Stephen Griffin and the National Science Foundation must also be thanked for both the seed-funding grant to build the International Music Information Retrieval System Evaluation Laboratory and the funding support for the SIGIR 2003 workshop. My colleagues at the Graduate School of Library and Information Science have been extremely supportive of my work and I owe them an eternal debt of gratitude. Karen Medina and Gina Lee, my Graduate Assistants, are also thanked most heartily for their work in preparing the papers for publication. All Media Guide and HNH Ltd. are thanked for their important support in helping us populate the test collection with valuable data. Mike Welge and the NCSA are thanked for their time and resource support of the project and IMIRSEL. The MIR/MDL community is thanked for their outpouring of support, especially the letters they have written on our behalf as we sought start-up funding. Finally, I want to thank all those authors that have contributed White Papers for us to consider.

i A more comprehensive summary of the findings and recommendations of the “MIR/MDL Evaluation Frameworks Project” is to be published as: Downie, J. Stephen. 2003. Toward the scientific evaluation of music information retrieval systems. In Proceedings of the 4th International Conference on Music Information Retrieval (ISMIR 2003), Baltimore, MD, in press. ii See Appendix C for an abstract of the project's basic features. This abstract to be published in the Proceedings of SIGIR 2003.

80

Toward Evaluation Techniques for Music SimilarityBeth Logan

HP LabsCambridge MA USA

[email protected]

Daniel P.W. EllisColumbia UniversityNew York NY U.S.A.

[email protected]

Adam BerenzweigColumbia UniversityNew York NY U.S.A.

[email protected]

ABSTRACTWe describe and discuss our recent work developing a database, me-thodology and ground truth for the evaluation of automatic techniquesfor music similarity. Our database consists of acoustic and textual‘Web-mined’ data covering 400 popular artists. Of note is our tech-nique of sharing acoustic features rather than raw audio to avoid copy-right problems. Our evaluation methodology allows any data sourceto be regarded as ground truth and can highlight which measure formsthe best collective ground truth. We additionally describe an evalua-tion methodology that is useful for data collected from people in theform of a survey about music similarity. We have successfully usedour database and techniques to evaluate a number of music similarityalgorithms.

1. INTRODUCTIONThe ubiquity of digital compression formats is transforming the waythat people store, access and acquire music. Central to these changesis a need for algorithms to automatically organize vast audio repos-itories. Techniques to automatically determine music similarity willbe a necessary component of such systems and as such have attractedmuch attention in recent years. [10, 9, 13, 11, 1, 8].

However, for the researcher or system builder looking to use or designsimilarity techniques, it is difficult to decide which is best suited forthe task at hand simply by reading the literature. Few authors performcomparisons across multiple techniques, not least because there isno agreed-upon database for the community. Furthermore, even ifa common database were available, it would still be a challenge toestablish an associated ground truth, given the intrinsically subjectivenature of music similarity; It is not immediately clear how to obtaina reference ground truth for music similarity, since it is a naturallysubjective phenomenon. It can vary not only across users, but acrosstime, according to mood and according to context. Previous work hasexamined finding the ground truth for such a database [8].

In this paper, we describe our recently developed methodology anddatabase for evaluating similarity measures. Our goal is to developthree key components necessary for a healthy community of compa-rable music similarity research: (1) A large scale, sharable databaseof features derived from real music; (2) ground truth results that bestapproach the ideal subjective outcomes, and (3) general, appropriateand accurate evaluation methodologies for this kind of work. Of these,the idea of a single ground truth is most problematic, since there is noparticular reason to believe that similarity between two artists exists

Permission to make digital or hard copies of all or part of thiswork for personal or classroom use is granted without fee providedthat copies are not made or distributed for profit or commercialadvantage and that copies bear this notice and the full citation onthe first page.

other than in the context of particular individual’s taste. Although notwo music listeners will completely agree, we still think it is useful totry and capture some kind of ‘average’ consensus.

We have previously validated our approach by comparing a varietyof acoustic and subjective similarity measures on a large amount ofcommon data at multiple sites [3]. Although our work has focusedon artist similarity, our techniques extend to song similarity givena suitable database. We hope that our work will provide a helpfulexample and some useful techniques for other researchers to use.Ideally, we would like to see different sites contribute to a shared,common database of Web-mined features and copyright-friendly front-end features derived from their locally-owned music, as describedbelow.

This paper is organized as follows. First we discuss some of thedifferent kinds of music similarity measures in order to motivate thedata and techniques required for evaluation. Next we describe ourevaluation database, followed by the determination of ground truthand our evaluation methodologies. Finally, we discuss the results ofour recent music similarity evaluation and our conclusions.

2. MUSIC SIMILARITY MEASURESMusic similarity measures rely on one of three types of information:symbolic representations, acoustic properties, and subjective or ‘cul-tural’ information. Let us consider each of these from the perspectiveof their suitability for automatic systems.

Many researchers have studied the music similarity problem by ana-lyzing symbolic representations such as MIDI music data, musicalscores, etc., or by using pitch-tracking to create a score-like ‘melodycontour’ for a set of musical recordings. String matching techniquesare then used to compare the transcriptions for each song. [4, 12, 10].However, only a small subset of music has good-quality machine-readable score descriptions available, and automatic transcription be-comes difficult and error-prone for anything other than monophonicmusic. Thus, pitch-based techniques are only applicable to single-voice music and approaches based on MIDI or scores can only be usedfor music which is already in symbolic form.

Acoustic approaches analyze the music content directly and thuscan be applied to any music for which one has the audio. Mosttechniques use data derived from the short-term frequency spectrumand/or rhythm data. Typically, these features are modeled by oneof a variety of machine learning techniques and comparisons in thisdomain are used to determine similarity [5, 9, 13, 11, 1, 2].

With the growth of the Web, techniques based on publicly-availabledata have emerged [7, 8, 14]. These use text analysis and collabo-rative filtering techniques to combine data from many individuals to

81

determine similarity based on subjective information. Since they arebased on human opinion, these approaches capture many cultural andother intangible factors that are unlikely to be obtained from audio.The disadvantage of these techniques, however, is that they are only ap-plicable to music for which a reasonable amount of reliable Web datais available. For new or undiscovered artists, effective audio-basedtechniques would have a great advantage.

Given our bias toward automatic techniques applicable to actual musicrecordings, we will focus on the latter two approaches in this paper.We now turn to the types of data required to determine similarity inthe acoustic and ‘web-mined’ or subjective domains.

2.1 Data for Acoustic SimilarityIdeally, a database for evaluating acoustic similarity techniques wouldcontain the raw audio of each song. This would enable an unlimitedvariety of features and models to be investigated and would addi-tionally allow researchers to ‘spot check’ the results using their ownjudgment by listening to the pieces.

Unfortunately, copyright laws obstruct sharing data in this fashion.Until this issue is resolved (possibly a long wait), we propose insteadthe sharing of acoustic features calculated from the audio files. Forexample, in our recent evaluation we shared Mel-frequency cepstralcoefficients (MFCCs) for each song. Starting from these commonfeatures, we were able to compare different algorithms on the samedata, and we even saved some bandwidth transferring this data insteadof the original waveforms. The best acoustic reconstruction possiblefrom these reduced representations is only vaguely recognizable as theoriginal music, so we are confident that sharing derived data of thiskind will present no threat to copyright owners. Indeed, it is almostaxiomatic that a good feature representation will eliminate much ofthe information present in the original signal, paring it down to leaveonly the essentials necessary for the task in question 1.

MFCC features are currently popular as a basis for music similaritytechniques, but their use is by no means as ubiquitous as it is in speechrecognition. It is likely that over time researchers will add additionalfeatures to their repertoires. Until it is possible for sites to share rawaudio then, we propose that authors share and distribute tools for thecalculation of promising features. By downloading these tools andpassing them over private collections, individual groups can generatefeatures that can then be shared.

2.2 Data for Subjective SimilaritySubjective similarity can be determined using sources of human opin-ion mined from the Web. Here the required data is highly dependenton the technique used and the time at which the data was mined. Wepropose then that researchers using such techniques make their dis-tilled datasets publicly available so that algorithms can be comparedon the same data. We give examples of such datasets in the descriptionof our database below.

3. EVALUATION DATABASEOur database consists of audio and Web-mined data suitable for deter-mining artist similarity. The dataset covers 400 artists chosen to havethe maximal overlap of two of our main sources of Web-mined data:the artists best represented on the OpenNap peer-to-peer network in�

Although it could be argued that subjective music similarity dependson practically all the information of interest to a listener, we confi-dently predict that it will be many years before an automatic systemattempts to make use of anything like this richness.

mid 2002, and the “Art of the Mix” playlist data from early 2003. Wepurchased audio and collected other data from the Web to cover theseartists. We describe each of these sources in more detail below.

3.1 Audio FeaturesThe audio data consists of 8827 songs with an average of 22 songsper artist. As described above, we pooled data between our differentlabs in the form of MFCC features rather than the original waveforms,both to save bandwidth and to avoid copyright problems. This had theadded advantage of ensuring both sites started with the same featureswhen conducting experiments.

3.2 Survey DataHuman similarity judgments came from our previously-constructedsimilarity survey website [8], which explicitly asked human informantsfor judgments: We defined a set of some 400 popular artists thenpresented subjects with a list of 10 artists

�� , and a singletarget artist

� �, asking “Which of these artists is most similar to the

target artist?” We interpret each response to mean that the chosenartist

��is more similar to the target artist

� �than any of the other

artists in the list if those artists are known to the subject. For eachsubject. we infer which artists they know by seeing if the subject everselects the artists in any context.

Ideally, the survey would provide enough data to derive a full similaritymatrix, for example by counting how many times informants selectedartist

��being most similar to artist

��. However, even with the 22,300

responses collected (from 1,000 subjects), the coverage of our modestartist set is relatively sparse.

3.3 Expert OpinionAnother source of data is expert opinion. Several music-related onlineservices contain music taxonomies and articles containing similaritydata. The All Music Guide (www.allmusic.com) is such a service inwhich professional editors write brief descriptions of a large numberof popular musical artists, often including a list of similar artists. Weextracted the similar artist lists from the All Music Guide for thesame 400 artists in our set, discarding any artists from outside the set,resulting in an average of 5.4 similar artists per list.

3.4 Playlist Co-occurrenceYet another source of human opinion about music similarity is human-authored playlists. We assume that such playlists contain similar music— certainly an oversimplification, but one that turned out to be quitesuccessful in our evaluations.

Again, the Web is a rich source for such playlists. In particular, wegathered over 29,000 playlists from “The Art of the Mix” , a websitethat serves as a repository and community center for playlist hobbyists(www.artofthemix.org). After filtering for our set of 400 artists, wewere left with some 23,000 lists with an average of 4.4 entries.

3.5 OpenNap User CollectionsSimilar to user-authored playlists, individual music collections are an-other source of music similarity often available on the Web. Mirroringthe ideas that underly collaborative filtering, we assume that artists co-occurring in someone’s collection have a better-than-average chanceof being similar, which increases with the number of co-occurrencesobserved.

We retrieved user collection data from OpenNap, a popular musicsharing service, although we did not download any audio files. After

82

Source # obs art/obs �� obs �� obs med#artSurvey 17,104 5.54 7.49% 0.36% 23Expert 400 5.41 1.35% - 5Playlist 23,111 4.38 51.4% 11.4% 213Collection 3,245 54.3 94.1% 72.1% 388

Table 1: Sparsity of subjective measures: For each subjective datasource we show the number of ‘observations’, the average numberof valid artists in each observation, the proportion of the 79,800artist pairs for which at least 1 co-occurrence or direct judgmentwas available, the proportion with 10 or more observations, andthe median count of comparison artists (out of 400).

discarding artists not in our data set, we were left with about 175,000user-to-artist relations from about 3,200 user collections.

3.6 SparsityA major difference between audio-based and subjective similaritymeasures lies in the area of data coverage: automatic measures baseddirectly on the waveform can be applied to any pair of examples, evenover quadratically-sized sets given sufficient computation time. Sub-jective ratings, however, inevitably provide sparse coverage, whereonly some subset of pairs of examples are directly compared. In thepassive mining of subjective opinions provided by expert opinion andplaylist and collection co-occurrence, there will be many artists whoare never observed together, giving a similarity of zero. In the survey,we were able to choose which artists were presented for comparison,but even then we biased our collection in favor of choices that werebelieved to be more similar based on prior information. Specific spar-sity proportions for the different subjective data sources are given inTable 1, which shows the proportion of all �� artist pairswith nonzero comparisons/co-occurrences, the proportion with 10 ormore observations (meaning estimates are relatively reliable), and themedian number of artists for which some comparison information wasavailable (out of 400). (For more details, see http://www.ee.columbia.edu/˜dpwe/research/musicsim/.)

Two factors contribute to limit co-occurrence observations for cer-tain artists. The first is that their subjective similarity may be verylow. Although having zero observations means we cannot distinguishbetween several alternatives that are all highly dissimilar to a giventarget, this is not a particularly serious limitation, since making preciseestimates of low similarity is not important in our applications. Thesecond contributory factor, however, is unfamiliarity among the infor-mant base: If very few playlists contain music by a certain (obscure)band, then we have almost no information about which other bands aresimilar. It is not that the obscure band is (necessarily) very differentfrom most bands, but the ‘threshold of dissimilarity’ below which wecan no longer distinguish comparison artists is much lower in thesecases. The extreme case is the unknown band for which no subjectiveinformation is available – precisely the situation motivating our use ofacoustic similarity measures.

4. EVALUATION METHODSIn this section, we describe our evaluation methodologies. The firsttechnique is specific to the survey data which presents data in tripletsand has sparse coverage. The second approach is a general way tocompare two similarity matrices whose

�� element gives thesimilarity between artist

�and artist � according to some measure.

This technique is useful to gauge agreement between measures.

The choice of ground truth affects which technique is more appropriate.On the one hand, the survey explicitly asked subjects for similarityratings and as such it might be regarded as a good source of groundtruth. On the other hand, we expect many of the techniques based onthe Web-mined data to be good sources of ground truth since they arederived from human choices.

4.1 Evaluating against survey dataThe similarity data collected using our Web-based survey can be ar-gued to be a good independent measure of ground truth artist similaritysince subjects were explicitly asked to indicate similarity. We can com-pare the survey informant judgments directly to the similarity metricthat we wish to evaluate. That is, we ask the similarity metric thesame questions that we asked the subjects and compute an averageagreement score.

We used two variants of this idea. The first, “average response rank”,takes each list of artists presented to the informant and ranks it ac-cording to the similarity metric being tested. We then find the rankin this list of the choice picked by the informant (the ‘right’ answer),normalized to a range of 1 to 10 for lists that do not contain 10 items.The average of this ranking across all survey ground-truth judgmenttrials is the average response rank; For example, if the experimentalmetric agrees perfectly with the human subject, then the ranking ofthe chosen artist will be 1 in every case, while a random ordering ofthe artists would produce an average response rank of 5.5. In practice,the ideal score of 1.0 is not possible because informants do not alwaysagree about artist similarity; therefore, a ceiling exists correspondingto the single, consistent metric that best matches the survey data. Forour data, this was estimated to be 1.98.

A different way of using the survey data is to view each judgment asseveral 3-way sub-judgments that the chosen artist

� �is more similar

to the target� �

than each unchosen artist��

in the list – that is� �� where

� �� is the similarity metric. The “triplet agreement score” iscomputed by counting the fraction of such ordered “triplets” for whichthe experimental metric gives the same ordering.

4.2 Evaluation against similarity matricesAlthough the survey data is a useful and independent evaluation set,it is in theory possible to regard any of our subjective data sources asground-truth, and to seek to evaluate against them. Given a referencesimilarity matrix derived from any of these sources, we can use anapproach inspired by the text information retrieval community [6]to score other similarity matrices. Here, each matrix row is sortedby decreasing similarity and treated as the result of a query for thecorresponding target artist. The top � ‘hits’ from the reference matrixdefine the ground truth (where � is chosen to avoid the ‘sparsitythreshold’ mentioned above) and are assigned exponentially-decayingweights so that the top hit has weight 1, the second hit has weight�� , the next �! � and so on, where �!�#" � . The candidate similaritymatrix ‘query’ is scored by summing the weights of the hits by anotherexponentially-decaying factor, so that a ground-truth hit placed at rank$ is scaled by � �&% ��

. Thus this “top-N ranking agreement score” ' �for row

�is

' �)(+*,�.- �

� �&% �� !/.0 % ��

where 1 � is the ranking according to the candidate measure of the $��2

-ranked hit under the ground truth. � � and �� govern how sensitive

83

#mix MFCC Anchor8 4.28 / 63% 4.25 / 64%

16 4.20 / 64% 4.19 / 64%32 4.15 / 65% -

Table 2: Survey evaluation metrics (average response rank /triplet agreement percentage) for K-means Models of MFCC fea-tures (‘MFCC’) and GMM models of Anchor Space features (‘An-chor’). #mix gives the number of K-means clusters or mixturecomponents.

the metric is to ordering under the candidate and reference measuresrespectively. With � ( �� , �� ( � � � �� and � � ( �! � (the valueswe used, biased to emphasize when the top few ground-truth hitsappear somewhere near the top of the candidate response), the bestpossible score of 2.0 is achieved when the top 10 ground truth hitsare returned in the same order by the candidate matrix. Finally, theoverall score for the experimental similarity measure is the average ofthe normalized row scores

� ( �*

� *� ' � �'�� , where '�� is thebest possible score. Thus a larger ranking agreement score is better,with 1.0 indicating perfect agreement.

5. EXPERIMENTAL RESULTSWe have previously used our database and methodology to comparea variety of similarity measures [3]. These approaches succeeded inmaking possible comparisons between different parameter settings,models and techniques.

For example, Table 2 reproduces results from [3] comparing twoacoustic-based similarity measures, using either a K-means clusterof MFCC features to model each artist’s repertoire, compared viaEarth-Mover’s Distance [11], or a suite of pattern classifiers to mapMFCCs into an “anchor space”, in which probability models are fitand compared [2].

Table 2 shows the average response rank and triplets agreement scoreusing the survey data as ground truth as described in Section 4.1. Wesee that both approaches have similar performance under these metrics,despite the prior information encoded in the anchors. It would havebeen very difficult to make such a close comparison without runningexperiments on a common database.

The scale of our experiment gives us confidence that we are seeingreal effects. Access to a well-defined ground truth (in this case thesurvey data) enabled us to avoid performing user tests, which wouldhave likely been impractical for this size database.

Using the techniques of Section 4.2 we were also able to make pairwisecomparisons between all our subjective data measures, and to comparethe two acoustic models against each subjective measure as a candi-date ground truth. The rows in Table 3 represent similarity measuresbeing evaluated, and the columns give results treating each of our fivesubjective similarity metrics as ground truth. Scores are computed asdescribed in Section 4.2. For this scoring method, a random matrixscores 0.03 and the ceiling, representing perfect agreement with thereference, is 1.0.

Note the very high agreement between playlist and collection-basedmetrics: One is based on user-authored playlists, and the other oncomplete user collections. It is unsurprising that the two agree. The

moderate agreement between the survey and expert measures is alsounderstandable, since in both cases humans are explicitly judging artistsimilarity. Finally, note that the performance of the acoustic measuresis quite respectable, particularly when compared to the expert metric.

The mean down each row and column, excluding the self-referencediagonal, are also shown. We consider the row means to be an overallsummary of the experimental metrics, and the column means to bea measure of how well each measure approaches as ground truthby agreeing with all the data. By this standard, the expert measure(derived from the All Music Guide) forms the best reference or groundtruth.

6. CONCLUSIONS AND FUTURE PLANSWe have described our recent work developing a database, methodol-ogy and ground truth for the evaluation of automatic techniques formusic similarity. Our database covers 400 popular artists and containsacoustic and subjective data. Our evaluation methodologies can useas ground truth any data source that can be expressed as a (sparse)similarity matrix. However, we also propose a way of determiningthe ‘best’ collective ground truth as the experimental measure whichagrees most often with other sources.

We believe our work represents not only one of the largest evaluationsof its kind but also one of the first cross-group music similarity eval-uations in which several research groups have evaluated their systemson the same data. Although this approach is common in other fields,it is rare in our community. Our hope is that we inspire other groupsto use the same approach and also to create and contribute their ownequivalent databases.

As such, we are open to adding new acoustic features and other data toour database. At present, we have fixed the artist set but if other sitescan provide features and other data for additional artists these couldbe included. We would also welcome new feature calculation toolsand scoring methodologies.

In order for this to take place, we are in the process of setting upa Website, www.musicseer.org, from which users can download ourdatabase, feature calculation tools and scoring scripts. Other groupswill be encouraged to submit their own data or features and scripts.We foresee no copyright problems given we are merely exchangingacoustic features that cannot be inverted into illegal copies of theoriginal music. We hope that this will form the basis of a collec-tive database which will greatly facilitate the development of musicsimilarity algorithms.

7. ACKNOWLEDGMENTSSpecial thanks to Brian Whitman for the original OpenNap dataset, forhelp gathering the playlist data, and for generally helpful discussions.

8. REFERENCES[1] J.-J. Aucouturier and F. Pachet. Music similarity measures:

What’s the use? Proc. Int. Symposium on Music Info. Retrieval(ISMIR), 2002.

[2] A. Berenzweig, D. P. W. Ellis, and S. Lawrence. Anchor spacefor classification and similarity measurement of music. ICME2003, 2003.

[3] A. Berenzweig, B. Logan, D. P. W. Ellis, and B. Whitman. Alarge-scale evaluation of acoustic and subjective musicsimilarity measures. Submitted to Int. Symposium on Music

84

survey expert playlist collctn mean*survey - 0.40 0.11 0.10 0.20expert 0.27 - 0.09 0.07 0.14playlst 0.19 0.23 - 0.58 0.33collctn 0.14 0.16 0.59 - 0.30Anchor 0.11 0.16 0.05 0.03 0.09MFCC 0.13 0.16 0.06 0.04 0.10mean* 0.17 0.21 0.16 0.15

Table 3: Top-N ranking agreement scores for acoustic and subjective similarity measures with respect to each subjective measure as groundtruth. “mean*” is the mean of the row or column, excluding the shaded “cheating” diagonal. A random ordering scores 0.03.

Inform. Retrieval (ISMIR), 2003,http://www.ee.columbia.edu/˜dpwe/pubs/ismir03-sim-draft.pdf.

[4] S. Blackburn and D. D. Roure. A tool for content basednavigation of music. Proc. ACM Conf. on Multimedia, 1998.

[5] T. L. Blum, D. F. Keislar, J. A. Wheaton, and E. H. Wold.Method and article of manufacture for content-based analysis,storage, retrieval, and segmentation of audio information. U.S.Patent 5, 918, 223, 1999.

[6] J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysisof predictive algorithms for collaborative filtering. FourteenthAnnual Conference on Uncertainty in Artificial Intelligence,pp. 43–52, 1998, http://citeseer.nj.nec.com/breese98empirical.html.

[7] W. W. Cohen and W. Fan. Web-collaborative filtering:recommending music by crawling the web. WWW9 / ComputerNetworks 33(1-6):685–698, 2000, http://citeseer.nj.nec.com/cohen00webcollaborative.html.

[8] D. P. Ellis, B. Whitman, A. Berenzweig, and S. Lawrence. Thequest for ground truth in musical artist similarity. Proc. Int.Symposium on Music Info. Retrieval (ISMIR), 2002.

[9] J. T. Foote. Content-based retrieval of music and audio. SPIE,pp. 138–147, 1997.

[10] A. Ghias, J. Logan, D. Chamberlin, and B. Smith. Query byhumming. ACM Multimedia, 1995.

[11] B. Logan and A. Salomon. A music similarity function basedon signal analysis. ICME 2001, 2001.

[12] R. McNab, L. Smith, I. Witten, C. Henderson, andS. Cunningham. Towards the digital music library: Tuneretrieval from acoustic input. Digital Libraries 1996, pp. 11–18,1996.

[13] G. Tzanetakis. Manipulation, Analysis, and Retrieval Systemsfor Audio Signals. Ph.D. thesis, Princeton University, 2002.

[14] B. Whitman and S. Lawrence. Inferring descriptions andsimilarity for music from community metadata. Proc. Int.Comp. Music Conf., 2002.

85

Open Position: Multilingual Orchestra Conductor.Lifetime opportunity.

Eloi BatlleAudiovisual Institute

Universitat Pompeu FabraBarcelona, Spain

[email protected]

Enric GuausAudiovisual Institute


[email protected]

Jaume MasipAudiovisual Institute


[email protected]

ABSTRACTIn this paper we show the need to see the Music Informa-tion Retrieval world from different points of view in orderto make any progress. To help the interaction of differ-ent languages (engineers, musicians, psychologists, etc.), wepresent a tool that tries to link all those backgrounds.

1. INTRODUCTIONWhen we started research in Music Information Retrievalsome years ago, we start discussing different ideas, projectsand applications dealing with MIR. It was really a funny dis-cussion because, in fact, it became a brainstorming meeting.But after one hour and a half, we realized that differentmembers of our group were talking in different languages.While some of us were talking about research results usingHidden Markov Models and Zero Crossings of one specificinput waveform, others were talking about the expressive-ness and vivacity of that specific audio. Thus, the discussionturned over into a new direction: Should the Music Infor-mation Retrieval community be able to build a dictionary totranslate the information in both senses? We conclude thediscussion with an unanimous conclusion: Yes, it should! Inthis paper, we present a simple tool that could help in theconstruction of this dictionary although it is not the goal ofthis paper to explain it exhaustively. We just present it, asanother contribution to the existing tools, in order to betterdefine their features and requirements.

2. HUMAN ANALYSISA song or a musical piece can be analyzed, by humans, frommany different points of view. Melodic analysis is the mostintuitive one. The melody of a song can be easily defined, ac-cording to Leonard Bernstein, as the part of the music thatcan be whistled. In most cases, the melody is played by asinging voice, although it can be mixed with many differ-ent instruments. When different voices are singing together,the melody is often associated with the voice with higher

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page.

pitch. If the musical piece has no singing voice, the melodyis usually played by a specific instrument but, if there isno dominant instrument, the melody can be found whereverthe listener wants. Percussive music has no melodies.

We can see that it is not easy to define what the melody is.Different musical genres can interpret the melody concept inmany different senses as well as different socio-cultural as-pects can strongly affect our own perception of the melody,too. Any attempt to define what the melody is seems tobe a very difficult task for humans. Thus, nobody can ex-pect this difficult task be accomplished by computers. TheBernstein definition, the best one from our point of view, be-comes completely useless for the actual known programmingtechniques.

Music can also be analyzed from a rhythmical point of view.According to simple definitions, one could think that rhythmis whatever you can follow just hitting your leg with yourhand. Unfortunately, this definition is not valid here, be-cause it covers only a narrow subset of the whole meaningof the word. Rhythm can be analyzed in three different lev-els [3]: The first level is the macro-level rhythm analysis.This kind of analysis studies the structure of the piece, thatis, the chorus and solos in a song or the different acts in anopera. With the mid-level rhythm analysis one can distin-guish between different phrases in a song and observe howthey can be responded. Finally, at the micro-level rhythmanalysis, the note durations and rhythmic bases are studied.

There are lots of Drum Loops collections. Disk Jockeys willplay, mix and modify them in order to create different rhyth-mic patterns in their performances. Also, classical musiccomposers know that symphonies are divided into three sec-tions or movements. Both of them are working with rhythm,but at different cognition levels. When computers managerhythmic information, they must take into account all thesethree levels.

Harmony is also quite important in a musical analysis. Whilemelody involves the evolution in time of one specific part (in-strument) of the music, harmony involves all the instrumentsand notes played together at a given time lapse, that is, thechords. The evolution of these chords is also studied, as wellas voicing and sub-melodies created by the time evolutionof the different notes in the chords. Harmonic analysis workis very important for composers and performers, but not so

86

important for average listeners: nobody remembers a songfor a specific II m7 - V 7 - I maj7 succession! Then, wecan consider that harmonic analysis is not relevant for a lotof applications of Music Information Retrieval.

Timbre analysis has become more and more important inthe last twenty years. The evolution of music has been veryimportant in the last centuries, but not until recently timbrehas grown as a major study subject. Different genres havebeen created, but music has been played with almost thesame instruments. In the last years, with the fast growthof analog electronic (60’s and 70’s) and digital (80’s and90’s) technologies, a lot of new instruments have been cre-ated. These new instruments create new timbres. Further-more, sometimes the timbre can exactly identify a specificmusical piece or composer. Some of these new instrumentsare physical instruments while the other ones are “virtual”.Nowadays, timbre characteristics are really important in theaesthetic aspects of new music [4]. With virtual instrumentsone can create texture-based music: music without melodyand without rhythm, just playing with timbres.

3. AGAIN, DOES IT MAKE SENSE TO USEA COMPUTER?

This discussion is not new and other areas also opened sim-ilar questions: Should we use computers to solve subjectiveproblems? Or even more: Can we do it? AI researcherswork very hard to make the answer yes, but unfortunatelythere are some problems that it is not clear at all the pathwe should follow (if there is any).

In this section we introduce a method that is able to showsome interesting results when dealing with music similarityand retrieval. We may say that these results are better whenthe objectivity of the query is high, but with this statementa new question arises: if the answer is subjective how can wesay if it is right or wrong? Therefore the computer should seelife as a gradient in gray and not only black and white. Thismethod is described in more detail in [2]. The idea behindit lies on the fact that music (and all audio in general) canbe seen as a sequence of acoustic events. Since music has astrong meaning in its temporality, the similarity system willexploit this fact to process the audio.

Let’s imagine the following situation: we have a song playedby a guitar, then a violin, then a piano and finally again aguitar. We can describe this piece of music (up to some levelof abstraction) using its players, that is guitar → violin →

piano → guitar. Now we are given another piece of musicand we are asked to find its similarity with the former song.The question to answer is: can this second music piece beperformed with the sequence of players guitar → violin →

piano → guitar? Or, what is the same, as a conductor of anorchestra, can I reproduce the given music if I conduct theplayers in the order guitar → violin → piano → guitar? Willit sound more or less the same? If the answer is yes we havegot it: they are similar. Most MIR approaches work theother way around. The classical approach is to have a list ofdescriptions of all the music in the database. Then extracta description of the unknown music and find the closest de-scriptions from the original database. Our approach neverextracts a description of the unknown audio. We simply tryto play the new piece with the players of the known songs.

Figure 1: Data flow within the system process.

Of course, in the real world, music cannot be described withsimple instruments and songs are usually made of complexsounds. Therefore, the “players” we will use to describe ourmusic are going to be abstract and with no physical mean-ing. So now the name of the game is to find these abstractplayers from the music. We can do it using Hidden MarkovModels (HMM) [1] and based on their property as a doubleembedded stochastic process: one that can be seen (the mu-sic itself) and one that is hidden (the orchestra). We willuse this system based on the source generation of the soundrather than its description with a sequence of feature andparameters like MFCC, spectral flatness, etc. to see and dis-cuss the advantages, disadvantages, uses and limitations ofautomatic music information retrieval systems. (ok, writingdown the sequence of generator is like writing a sequenceof features, but we have to look at the inner philosophy ofthis sentence). Since the generators of the audio are alsodescriptors, we will use the two words indiscriminately. Oneof the main features of these HMM is that they can describethe generation

While analyzing fragments of music, we can refer to self-similarity and cross-similarity. We will talk about self-similaritywhen the analysis is with the audio against the audio itself.This will be very useful when trying to find musical struc-tures, chorus, repetitions, etc. On the other hand, we havecross-similarity when the analysis is performed against otherpieces, useful to find clusters with the same musical style,music browsing, etc.

3.1 System structureThe system inputs are the main audio track, the to-be-compared audio track and the audio descriptors. The sys-tem output is the similarity structure between both audioinputs. Both audio tracks are first transformed into featurevector parameters and then observed by the audio descrip-tors. The main track observations are used to generate afingerprint sequence of audio descriptors through a Viterbialgorithm as already described in the papers cited before andthen a similarity matrix is built by matching the comparedaudio descriptor observations against the main audio tracksequence of descriptors. Finally, the correspondence blocksdetermines which audio segments could have been generatedby the “same” sequence of observers and the classifier filtersout the sequences and finds similarity structures. Figure 1shows the data flows within the system process.

3.2 Similarity matrix

87

Figure 2: Similarity matrix.

Figure 3: Extraction of similar segments.

The similarity matrix is built as sketched in Figure 2. Firstcolumn shows the sequence of feature vectors extracted fromthe main audio track converted into a sequence of audio de-scriptors in the second column. The compared audio trackfeature vectors are represented as a row at the figure at thebottom. Each column of the graph shows the distance metricof three feature vectors against all the descriptor observers.In the example of Figure 2, the main and the compared audiotracks are the same which yields to a diagonal of maximumscores between the columns graphs. This metric is used foridentification purposes which is out of the scope of this pa-per. Moreover, this example assumed that the audio trackis composed with a theme repeated twice. As shown in thesame example, the repetition produces two secondary diag-onals, the top-left diagonal indicates that the second sectionis similar to the first, while the bottom-right indicates thatthe first section is similar to the second. We can concludetherefore that similarity correspondences between two audiosegments can be inferred from the matrix as continuous highscore diagonal lines.

3.3 CorrespondencesThe correspondence extraction block is in charge of extract-ing similar segment pairs from the similarity matrix. Thealgorithm combines a Hough transform [7] in the (+1,+1)direction vector with maximum detection in the (-1,+1)direction vector. Diagonal lines of Hough maximums arefirst characterized with their starting point and line length.Then, simple heuristics are applied to interpolate lines withdiscontinuities smaller than a certain maximum threshold.The output of the correspondence block is a sequence ofsimilar sequence pairs that shows the starting point in themain audio, the starting point in the compared audio andthe length of both segments. Figure 3 shows an example ofextracting a similar segment pair from the similarity matrix.

Figure 4: Classification of similar segments.

3.4 ClassificationThe last block is shown in Figure 4 and classifies similarsegment pairs and builds the overall similarity structurebetween both audio tracks. All similar segment pairs arematched between themselves by applying accuracy thresh-olds in the segment margins. This matching algorithm gen-erates sets with similar themes at different levels of granu-larity. Finally, the similarity sets are matched among them-selves to generate hierarchical structures with increasing gran-ularity.

4. OPEN DISCUSSIONIt is really pleasant to realize that the Music InformationRetrieval community has grown due to people coming frommany different cognition areas. For instance, musicians andengineers can talk about music, about their sensations whenlistening Mozart or U2, about their intentions when creatingdifferent textures and so on. But we should not be misleadby this fact. Although all of them talk about music, andtheir contribution to it always comes in handy, they oftenspeak in different languages. We know it is a well knownproblem, and it is not our intention to discuss that [5].

But from the computer science point of view, we think thatwe should focus our efforts through two different paths. Thefirst one is related to the so called objective parameters of

music, that is structure, rhythm or timbre description. It’sfairly easy to extract many parameters from music. Sometechniques need a lot of improvements, but in the nextfew years, we expect those techniques to be almost per-fect. Therefore, let us imagine that we have algorithms thatare able to extract as many features of any audio signal aswe want (or we need). What do we do with all this data?We can use it to classify genres, to find similarities betweenclarinet solos and so on. Do the computer science commu-nity take some important decisions to distinguish betweengenres? They sometimes do, but they often lack the back-ground. Musicologists should be included in the researchgroups.

On the other hand, we should focus our studies in the per-ceptual aspects of music [6]. Why is the tuba sound gen-erally associated to weight feelings? The main problem tostudy this is we don’t know exactly how the human brainworks. If we don’t know this, how do we expect a computerperforming this task? It is impossible. Psychologist shouldbe included in the research groups.

88

MUSICIANS

Musical aspects (melody, rhythm...)Social aspects (genres, implications...)

ENGINEERS &COMPUTER SCIENCE

Descriptors (MFCC, Spectral Centroid...)Transmission (MPEG, Perceptual Compression...)

PSYCHOLOGISTS

Brain behaviourFeelings (happy, sensual...)Cognition (melody, rhythm...)

RETRIEVALINFORMATIONMUSIC

Unrelated (but useful)connections

AMADEUSjust a tool!

Figure 5: General Overview of Music Information

Retrieval

The system described in Figure 4 should be a small contri-bution to that meeting point that sometimes seems to beinexistent. We can extract different timbre, rhythmical and(still not available) melodic structures. Note that we aremanaging musical concepts, not low level engineering con-cepts such as MFCC, Spectral Centroid, etc. These last con-cepts are used but, never shown. Musical and Psychologicalknowledge is the main architect of this transformation.

As mentioned above, we have to think in a gray scale. Psy-chologist and Musicians have to transform this picture intoa color landscape.

5. REFERENCES[1] Batlle, E., Cano, P. ”Automatic Segmentation for

Music Classification using Competitive HiddenMarkov Models”, Proceedings of the International

Symposium on Music Information Retrieval,Plymouth, USA, 2000.

[2] Batlle, E., Masip, J., Guaus, E. ”Automatic SongIdentification in Noisy Broadcast Audio”,Proceedings of the International Conference on

Signal and Image Processing, Kauai, USA 2002.

[3] Stephen Handel. Listening: An Introduction to the

Perception of Auditory Events. The MIT Press.1991. 2nd. Edition.

[4] Javier Blanquez y Omar Morera. Prologo de SimonReynolds. Loops una historia de la musica

electronica. Ed. Mondadori. Barcelona, 2002.

[5] Overview on the Sound Modeling Panel. Workshop

on Current Research Directions in Computer

Music. Barcelona, Nov 15-16-17, 2001.

[6] Vinet, H. Herrera, P. Pachet, F.. The CUIDADO

Project. Proceedings of ISMIR 2002 - 3rdInternational Conference on Music InformationRetrieval, Paris, France, 2002.

[7] Illingworth, J., Kittler, J., “A survey of the Houghtransform,”, Computer Vision, Graphics, and

Image Processing, vol. 44, pp. 87-116, 1988.

89

Emphasizing the Need for TREC-like Collaboration Towards MIR Evaluation

Shyamala Doraisamy Department of Computing

180 Queen’s Gate London SW7 2BZ

+44-(0)20-7594-8180

[email protected]

Stefan M Rüger Department of Computing

180 Queen’s Gate London SW7 2BZ

+44-(0)20-7594-8355

[email protected]

ABSTRACT The need for standardized large-scale evaluation of music information retrieval (MIR) and music digital library (MDL) methodologies is being addressed with the recent resolution calling for the construction of the infrastructure necessary to support MIR/MDL research. The methodology of our MIR study investigating the use of n-grams for polyphonic music retrieval has been based on a small-scale test collection of around 10,000 polyphonic MIDI files developed by us. A review of MIR studies show that a number of researchers have similarly developed individual state-of-the-art test collections for the purpose of metric scientific evaluation. However, there are a number of potential problems that are generic to these various small-scale test collections. These include the lack of consistency and the completeness of the relevance judgements used. This paper discusses a number of test collections developed by various researchers and ourselves, and describes how many of the limitations generic to all these studies could be overcome through the development of standardized large-scale test collections.

1. INTRODUCTION With the advancement in multimedia and network technologies, the interest in music information retrieval (MIR) and music digital libraries (MDL) has grown noticeably over the past few years. There already exist an appreciable number of MIR systems that are commercially viable and of a high degree of sophistication. However, with the lack of standardized, generally agreed-upon test collections, tasks and metrics for MIR evaluation, researchers and developers are facing difficulties in benchmarking and endorsing the performances of their systems. The need for standardized large-scale evaluation of MIR methodologies has been identified only very recently with the resolution for the construction of the infrastructure necessary to support MIR/MDL research [1].

One of the approaches proposed towards MIR evaluation is the use of test collections based on the Cranfield and TREC models [1, 14] that have been used for many years in the text retrieval community. A test collection for information retrieval (IR) encompasses (i) a set of documents, (ii) a set of queries, and (iii) a set of relevance judgments, and aims to model real-world situations, so that one could expect the performance of a retrieval system on the test collection to be a good approximation of its performance in practice. Of the two series of studies conducted at Cranfield University, UK, it is the second series conducted in the

1960s that became the exemplar for experimental evaluation of IR systems (with the first conducted in the 1950s) [9,10]. The Cranfield 2 test collection consists of 1,400 documents, mainly in the field of aerodynamics with 221 search questions [26]. Because of computer storage and processing costs, it was not until the 1980s, however, that large-scale testing became possible. In the early 90s the Cranfield 2 paradigm was gradually replaced by TREC (Text REtrieval Conference), a major initiative in IR evaluation with an emphasis on large collection size and completeness of relevance judgements. It made possible the large-scale, robust evaluation of text retrieval methodologies and has been running successfully since [5,9].

An important factor in the success of an evaluation is the task description and the metrics used to score the quality of a response. “Intuitively understandable metrics that map to commercially significant problems” have been described as one desirable feature of an evaluation [5, 22]. Query by melody (QBM), a task useful to most classes of MIR users (music librarians, disc-jockeys, music scholars, etc.), has been the research focus of a large number of MIR studies. There are various interfaces to these: query by humming (QBH), text-boxes for contour or absolute note letter-names input, or a graphically visualized keyboard. In our study of full-music indexing of polyphonic music, we are concerned with two tasks: QBM for monophonic queries and query by example (QBE) for polyphonic queries [16].

Relevance judgments are what turn a set of documents and topics into a test collection. Deciding which documents are relevant given a particular query constitutes an important theoretical and practical challenge. Among the more immediate questions it raises are: would performances with the melody transposed or played in a different speed be considered similar? Must performances to be retrieved be in the same genre? Was the melody heard as part of the accompaniment? How does the computer intuitively decide which is the accompaniment? Can the melody be slightly varied? If yes, what is degree of variation allowed? Must the query be complete in structure musically (theme, motive, phrase, fugal subject, etc)? Is a query with an arbitrary number of notes possibly not complete musically, analogous to half a word or sentence with text? It would be difficult to define relevant documents if the query is not even considered valid. These are only a few of the issues that need to be addressed in order to define relevance, a notion that takes a central position in IR evaluation.

In developing a list of relevant documents for a query, TREC uses a pooling technique, whereby diverse retrieval systems suggest documents for human evaluation. Documents that are not in the


90

pool, because all systems failed to rank the document high enough, are assumed to be irrelevant to the topic. Human assessors then judge the relevance of the documents in the pool. The quality of the judgements created using the pooling technique should be assessed, in particular with respect to their completeness and the consistency of the relevance judgements. Completeness measures the degree to which all the relevant documents for a topic have been found while consistency measures the degree to which the human assessor has marked all relevant documents as relevant and the irrelevant documents as irrelevant [5].

With MIR, one of the first steps taken towards standardized music test collections was to list candidate MIR test collections. These collections are useful by themselves for a number of research projects but would be even more useful if they were accompanied by a set of well-defined queries and relevance judgements. The Uitdenbogerd and Zobel collection is a notable exception as it does come with a set of (however incomplete) human relevance judgments [14].

The paper is structured as follows: Section 2 discusses some of the MIR studies that have based the evaluation on relatively large test collections. The discussion includes the relevance judgments and effectiveness measures used. Section 3 discusses the experiments of our MIR study on polyphonic music indexing and retrieval with n-grams using a test collection we developed. Section 4 discusses potential problems of the various test collections and re-emphasizes the need for TREC-like collaboration.

2. BACKGROUND In this section, we discuss a number of MIR studies for which researchers have developed individual small-scale test collections (using between 3000 to 10,000 documents) for evaluation [3,12,21, 23, 24]. We will focus on issues such as complexity of music data and queries, relevance judgements and effectiveness measures used.

2.1 COLLECTIONS AND QUERIES There are a number of additional problems that MIR researchers face when dealing with music data for computer-based MIR systems when compared to text. Music data could come as simple monophonic sequences, where a single musical note is sounded at any one time, or as polyphonic sequences, where several notes may be sounded at any one time. Music data is multi-dimensional with musical sounds commonly described in terms of pitch, duration, dynamics and timbre. Music data can be encoded in multiple formats: highly structured, semi-structured or highly unstructured. A few studies have included additional pre-processing modules to deal with these various aspects of music data, for the data collection and/or queries. The collection and queries used in the studies by Downie [3] and Sødring et al [23] were monophonic. Melody extraction algorithms were used for the studies by Uitdenbogerd [12] and Kosugi et al [21] to preprocess a polyphonic collection. The collection comprised the monophonic sequences obtained from the preprocessing step along with monophonic queries. The study by Pickens et al [24] used both polyphonic queries and source collection. The collection was encoded in a highly structured format and a

prototype polyphonic audio transcription system was integrated to transcribe polyphonic queries in the audio format.

Collection sizes varied between 3000 and 10,000 music files. The studies by Downie [3] and Sødring et al [23] used the NZDL [14] collection of about 10,000 folksongs in the monophonic format. Uitdenbogerd [12] and Kosugi et al [22] both used around 10,000 MIDI files where the former downloaded music of various genres from the Internet and the latter obtained the collection from a company in Japan. Kosugi et al [22] chose MIDI as the format for their collection as there is a large amount of MIDI available in Japan, where the popularity of karaoke ensures easy access to all the latest pop hits. Most karaoke recordings store the melody data on one MIDI channel, which makes it easy to recognize the melody [21]. The test collection developed by Pickens et al [24] used data provided by CCARH (http://www.ccarh.org). It consists of around 3000 files of separate movements from polyphonic full-encoded music scores by a number of classical composers (including Bach, Beethoven, Handel and Mozart). Three additional sets of variations, from which queries were extracted, were added with the final collection comprising a total of 3150 documents.

For query acquisition, approaches taken were either simulation/automatic or manual. Faced with the difficulty in obtaining real-world queries, many researchers simulate queries by extracting excerpts from pieces within the collection, and then using error models to generate erroneous queries. One such study is by Downie [3] and consists of two phases. In the first phase, 100 songs of a variety of musical styles were selected and queries of lengths 4, 6 and 8 were extracted from the incipits of each song. Thirty randomly selected pieces from the collection and a sub-string of length 11 from various locations in the piece were used in the second phase of the study. An error model based on the study by McNab et al [18] was used for error simulation.

Both the automatic and manual query acquisition approach were used in the study by Uitdenbogerd [12]. Automatic queries were selected from the collection by assuming that versions of a given piece formed a set of relevant documents. Versions were detected by locating likely pieces of music via the filenames and then verifying by listening to these pieces. One of the pieces from each set of versions was randomly chosen to extract an automatic query. Manual queries were obtained by asking a musician to listen to pieces that were randomly chosen from the set of pieces with multiple versions obtained from the automatic approach, and then to generate a query melody.

In the study by Sødring et al [23], 50 real music fragments were generated manually on a keyboard by a person with music knowledge before being transcribed into Parson’s1 notation to form the query set. 258 tunes hummed by 25 people were used as candidate queries for the study by Kosugi et al [21]. 186 tunes from these were recognized as melodies available in the database, and hence adopted as the query set. For the study by Pickens et al [24], the audio version of one variation was selected from each of the three sets of variations that had been added for the purpose of query extraction, was used as the query for the QBE task. They were unable to get human performances of all these variations.

1 An encoding that reflects directions of melodies.

91

Instead, queries were converted to MIDI and a high-quality piano soundfont was used to create an audio “performance”.

Looking at a few test collections, we can see the diversity in size, genres, formats, complexity and query acquisition approaches that have already been used in MIR studies.

2.2 RELEVANCE JUDGEMENTS TREC has almost always used binary relevance judgements (a document is relevant to the topic or not). There have been studies investigating the use of multiple relevance levels [8]. The most recent web track used a three point relevance scale: not relevant, relevant and highly relevant [6]. To overcome the difficulties in obtaining agreement on relevance, TREC uses the pooling technique to obtain a repository of candidate relevant documents and a number of human assessors to judge the relevance of these. In defining relevance for the assessors, the assessors are told: Assume that you are writing a report on the particular topic. If the document would provide helpful information then mark the entire document relevant, otherwise mark it irrelevant. A document is to be judged relevant regardless of the number of other documents that contain the same information [5].

With the need to evaluate MIR systems, a number of relevance definitions have been assumed. With the known item search used in the first phase of Downie’s [3] study, the document from which 100 query sequences were extracted was considered relevant and all remaining documents considered non-relevant. In the second phase, the set of relevant documents for a given query was defined as being the set of those songs in which the query’s progenitor string was found intact.

In the study by Uitdenbogerd [12], “automatic” and manual relevance judgements were used. Versions of a piece were considered to be relevant. Pieces were only considered to have distinct versions if there were obvious differences in the arrangement, such as (i) being in a different key, (ii) using different instruments or (iii) having differences in the rhythm, dynamics, or structure. All arrangements of the piece were assumed to be relevant versions and all other pieces assumed to be irrelevant, thus giving “automatic” relevance judgements. For manual relevance judgements, six judges were asked to listen to the pieces returned by the retrieval system for relevance assessment.

The pooling approach was adopted by Sødring et al [23] to obtain a set of relevant documents. The answer set obtained from submitting the 50 manually generated queries to their MIR system, was used as the relevant document set for the various experiments in their study. This approach was seen to be useful in generating a list of relevant documents for a set of queries without the need for human relevance judgements.

Audio queries were transcribed and submitted for retrieval for the study by Pickens et al [24] and Kosugi et al [21]. Being one of the first studies to use an audio polyphonic query, the study by Pickens et al [24] used one variation from each of the three query sets of variations as a query, and any variation within each corresponding set was considered relevant. With the monophonic hummed queries for the study of Kosugi et al [21], relevant documents were identified by song names.

In this section, we have shown that despite the difficulties in defining relevance for MIR where human music perception has to

be addressed, relevance had been defined within the scope of the various MIR studies, based on the need to evaluate the respective systems.

2.3 EVALUATION MEASURES Many measures of retrieval effectiveness have been defined and a number of these have been adopted for MIR studies. All the measures are based on the notion of relevance. The measures assume that, given a document collection and a query, some documents are relevant to the query and others are not. The objective of an IR system is to retrieve relevant documents and to suppress the retrieval of non-relevant documents [9]. Notable performance measures that were used for the Cranfield tests and continue to be widely used in IR are precision and recall. Recall is the proportion of relevant documents retrieved and precision is the proportion of retrieved documents that are relevant. Formally, given a query q, a set of retrieved documents A(q) and a set of relevant documents R(q), then recall r and precision p are defined as

|)(|

|)()(|

qR

qRqAr

∩= and |)(|

|)()(|

qA

qRqAp

∩= .

The official TREC reports several variants of precision and recall, such as the mean precision at various cut-off levels and a recall-precision graph. The mean average precision is often used as a single summary evaluation statistic [7]. Another measure that has been used is based on the rank of known-item search. The quality of the retrieval mechanism is judged by the reciprocal rank of the known item – e.g., if the known (and only relevant) item is retrieved at 5th rank, a quality of 0.2 would be assigned for this query. By repeating this process with many queries, a mean reciprocal rank (MRR) is obtained to assess a particular retrieval and indexing method, averaged over the number of queries. The MRR measure is between 0 and 1 where 1 indicates perfect retrieval.

With the complexities of music data [2], other benchmarking measures beyond precision and recall have been proposed. These include evaluating based on aspects of retrieval efficiency (e.g. speed of processing), software quality metrics and human computer interaction (HCI) and user interface (UI) features [13].

For the first phase of Downie’s [3] study, a modified measure of precision was used. Precision was defined as: P = 1/number of song titles retrieved. Queries were extracted from 100 songs. For the purpose of the study, a non-relevant hit was any song title retrieved other than that from which the query was extracted. The second study used normalized precision and recall. The normalized precision (NPREC) and normalized recall (NREC) metrics capture how closely a ranking system performs relative to the ideal by including in the calculation, information about the ranks at which relevant documents are listed. A NPREC or NREC value of 1 indicates that the ideal has been realized while a value of 0 indicates the worst case.

The NPREC and NREC metrics are defined as [25]:

NPREC = 1 - )!)!/(!log(

loglog11

RELRELNN

mRankREL

m

REL

m

m

−

−��==

92

NREC = 1 - )(

1 1

RELNREL

mRANKREL

m

REL

m

m

−

−� �= =

Where N is the number of documents in the database, REL the number of relevant documents contained in the database, and RANKm the rank assigned to relevant document m [3, 11].

Other standard measures adopted for MIR studies include: (i) eleven-point precision averages (recall and precision can be averaged at fixed recall levels to compute an overall eleven-point recall-precision average) [12], (ii) precision at k pieces retrieved (number of relevant melodies amongst the first k retrieved) [12, 23], (iii) precision/recall graphs [23], iv) mean average precision and mean precision at the top 5 retrieved documents [24].

For the study by Kosugi et al [21], where relevance was based on the song names, the percentage of songs retrieved within a given rank number formed the basis for their evaluation.

With relevance judgements defined to a certain extent within the scope of each particular study, standard evaluation measures (modified in some cases) have already proved useful for MIR evaluation. Whether or not such measures are useful for large-scale evaluation in a MIR context is a question that needs to be investigated further.

3. TEST COLLECTION This section describes the series of experiments we performed to investigate the use of n-grams for polyphonic music retrieval. Upon surveying the candidate music test collection [14] for our MIR study, one that was closely appropriate was the one developed by Uitdenbogerd and Zobel. However, we had difficulties obtaining this collection as the web site had been discontinued and permission to use the downloaded files failed due to copyright problems. We describe our approach towards the development of a test collection that was needed for our study. In particular, the test collection, i.e. the set of documents, queries, and relevance judgements, used in the various experiments are discussed.

A collection of almost 10,000 polyphonic MIDI performances that were mostly classical music performances had been obtained from the Internet [http://www.classicalarchives.com]. These were organized by composers – Bach, Beethoven, Brahms, Byrd, Chopin, Debussy, Handel, Haydn, Liszt, Mendelssohn, Mozart, Scarlatti, Schubert, Schumann and Tchaikovsky. Other composers’ works were organised in directories alphabetically (midi-a-e, midi-f-m, etc). The rest of the collection was categorized as aspire, early, encores and others. A smaller collection of around 1000 MIDI files of various categories of popular tunes - tv and movie themes, pop, oldies and folksongs was collected from the Internet (no longer available). Files that converted to text formats with warning messages on the validity of the MIDI file such as ‘no matching offset for a particular onset’, by the midi-to-text conversion utility, were not considered for the test collection. Various subsets of this collection formed the document-set for the different experiments. The various tasks and relevance used in these experiments are discussed.

3.1 EXPERIMENT 1 – PRELIMINARY INVESTIGATION A preliminary study on the feasibility of our approach of pattern extraction from polyphonic music data for full-music indexing was performed. N-grams were constructed based on a gliding window approach using all possible patterns of polyphonic music data [15]. Data analysis was performed to study the frequency distribution for the directions and distances of pitch intervals and ratios of the onset time differences that occur within the data set. For query simulation, polyphonic excerpts were extracted from 30 randomly selected musical documents similar to the study by Downie [3]. The only relevant document for this type of query is the music piece from which the query was extracted, and not variants or otherwise similar pieces. In simulating a variety of query lengths, lengths of the excerpts extracted from the randomly selected files were of 10, 30 and 50 onset times. With no error models available with polyphonic music queries, the Gaussian error model was used in generating erroneous queries from the perfect queries extracted.

3.2 EXPERIMENT 2 – QUERY BY MELODY, FAULT-TOLERANCE AND COMPARATIVE STUDY The second experiment was performed to test the feasibility of querying a polyphonic music collection with a monophonic sequence [17]. Monophonic queries are thought to resemble query by melody. In particular, we focus on QBH systems and the fault-tolerance of the n-gram approach was examined based on QBH error models. In order to simulate ad-hoc queries, where the collection is kept constant but the information need changes, we hand-crafted ten monophonic queries. These were popular tunes of various genres. The list of songs and relevant documents are listed in Table 1.

With pieces from the classical collection (Songs ID 1, 4, 6, 7, 8 and 10), using the filename and composer directory, one performance of each tune was identified. Each of these performance files was edited using a midi sequencer, jazz-4.1.3, to extract a polyphonic excerpt containing the theme. Pitches for the themes were referenced using the Dictionary of Musical Themes [20]. Using the retrieval approach and the optimal parameters identified from the first experiment, these polyphonic excerpts containing the theme were used as queries to form a relevant document pool. MIR studies that include the pooling approach to relevant document acquisition includes the studies by Uitdenbogerd [12] and Sødring et al [23]. The documents retrieved were listened to and its relevance was judged based on assumptions similar to Uitdenbogerad [12] (see section 2). From this polyphonic query excerpt, the theme was extracted manually as a monophonic sequence, using midi to text and vis-à-vis conversion utilities.

With the popular music pieces (Songs ID 3 and 9), only one version of each was available in the collection. These were used to extract monophonic queries using the midi sequencer. Versions for these were obtained by performing a search on the Internet using the same relevance assumption. Lastly, Happy Birthday (Song ID 2), assumed to be a tune that everybody knew, was added. This was not available in our collection at all, therefore as many versions possible, based on the relevance

93

assumption defined above, were searched for on the Internet and one of the versions with basic chords accompanying the tune was selected to extract the monophonic query. Query lengths varied between 15-25 notes for eight of the songs. The query for Beethoven’s Symphony No. 5 had just 8 notes and that for Hallelujah was the most elaborate with 285 notes.

Table 1. Song list

Song ID Song Title No. relevant

1 Alla Turka (Mozart) 5

2 Happy Birthday 4

3 Chariots of Fire 3

4 Etude No. 3 (Chopin) 1

5 Eine kleine Nachtmusik (Mozart)

5

6 Symphony No. 5 in C Minor, (Beethoven)

8

7 The WTC, Fugue 1, Bk 1 (Bach)

2

8 Für Elise (Beethoven) 3

9 Country Gardens 2

10 Hallelujah (Händel) 7

This test collection enabled us to perform further investigation on the use of n-grams towards full-music indexing of polyphonic music and a comparative study of various n-gramming strategies. QBH error models were surveyed and used to investigate the fault-tolerance of the indexing approach. For performance evaluation, we used the precision-at-15 measure, in which the performance of a system is measured by the number of relevant melodies amongst the first k retrieved, with k=15 in our case.

3.3 EXPERIMENT 3 – ROBUSTNESS AND ENVELOPES This study performed a more extensive investigation on the robustness of QBM and QBE tasks. The study also proposed and evaluated an approach to reduce the number of musical words generated with the n-gram approach to full-music indexing of polyphonic music [16]. Two types of queries were extracted from the music pieces: monophonic queries, where only one note per onset time was extracted and polyphonic queries, where a polyphonic subsequence of events was extracted from the music piece. The approach of extracting the highest pitch from the several possible pitches for each onset was the best of the several melody extraction algorithm investigated by Uitdenbogerd [12] and was therefore used to extract the monophonic queries as query melodies. The MRR measure was used as the evaluation metric for this experiment.

3.4 EXPERIMENT 4 – PROXIMITY ANALYSIS This section describes the test collection development for our on-going work on proximity analysis. Term position information

with indexes is known to improve the retrieval performance. An approach to obtain polyphonic term positions of “overlaying” musical words based on the nature of polyphonic music data is introduced. Query formulation using proximity operators and a ranking function that addresses the adjacency and concurrency of musical words generated from polyphonic data for better retrieval precision are being investigated.

In order to evaluate this approach, we have extended the query set to 50 queries (40 added to the 10 used in the third experiment). These were thought to be sufficient, based on the notion of using 50 topics from which queries are generated with TREC. These additional queries were considered as queries that were amongst the ‘pop’ of classical music. This query list can easily be extended to a more comprehensive list, possibly including real-world queries such as collection of queries from music libraries, music stores, etc [2]. Query acquisition from a Dictionary of Themes [20] we think is a useful repository. The query list, that comprises works of various composers and periods of music within the scope of tonal music, is deemed sufficiently comprehensive for the QBM task. A separate task may need to be defined for contemporary music with its own set of specific problems, such as difficulty to define a melody in this class of music [19].

Exhaustive judging, where relevance is determined for each document, is feasible for a collection of this size. The relevance assumption of versions used by Uitdenbogerd [12] was adopted and the first few seconds of each performance were listened to for a judgement to be made. This seemed a reasonable approach, as the author was sufficiently familiar with the query melodies to be able to recognize a performance within the first few seconds.

4. TREC-LIKE COLLABORATION It is clear from the MIR studies discussed in Section 2 and our work in Section 3, that the use of test collections has enabled MIR researchers to evaluate their work. Standardized test collections have been a useful benchmark for text retrieval evaluation for around 40 years now [5]. Manual relevance judgements, the pooling approach, known item searches, relevance by song titles/filenames and exhaustive judging have all been used with MIR test collections. All of these approaches, however, have potential problems such as completeness and consistency of the relevance judgements. These problems have been addressed with the large TREC collections and will need to be addressed when moving away from the small-scale test collections towards large-scale evaluation.

With so much research evaluated using test collections, the stability of the test collection is an important issue and has been addressed by TREC. Relative effectiveness of two retrieval strategies should be insensitive to slight changes in the relevant document set in order to reflect the true merit of the retrieval strategy being evaluated. The reasons as given by [4] for stability of system rankings despite differences in relevance judgements have been further discussed by [7]. The reasons for stability are as follows:

1. Evaluation results are reported as averages over many topics

2. Disagreements among judges affect borderline documents, which in general are ranked after documents that are unanimously agreed upon

94

3. Recall and precision depend on the relative position of the relevant and non-relevant documents in the relevance ranking, and changes in the composition of the judgement sets may have only a small effect on the ordering as a whole

It has been argued that the third reason may not apply to the large collection sizes of TREC where there could be hundreds of relevant documents. It has been shown, however, that first and the second reason appear to hold for the TREC collections [7]. In the context of MIR, it is unclear to what extent the second reason will prove valid, for it could be argued that human music perception would generate much larger differences in relevance judgments. This question clearly needs to be investigated further.

The investigation into the stability of system rankings with different sets of relevance assessments was carried out by NIST with the following tests [7]:

1. The use of the overlap of the relevant document sets to quantify the amount of agreement among different sets of relevance assessments [4]. Overlap is defined as the size of intersection of the relevant document sets divided by the size of the union of the relevant document sets

2. As a different view of how well assessors agree with one another, one set of judgements, say set Y, can be evaluated with respect to another set of judgements, set X. Assume the documents judged relevant in set Y are the retrieved set; then the recall and precision of that retrieved set using the judgements in X can be calculated.

3. The correlation can be quantified by using a measure of association between the different system rankings. A correlation based on Kendall’s � as the measure of association between two rankings can be used. Kendall’s � computes the distance between two rankings as he minimum number of pair-wise adjacent swaps to turn one ranking into the other. The distance is normalized by the number of items being ranked such that two identical rankings produce a correlation of 1.0, the correlation between a ranking and its perfect inverse is –1.0, and the expected correlation of two rankings chosen at random is 0.0.

When looking at completeness, it is also necessary to assess the degree of selection bias2 that occurs. Relevance judgments need to be unbiased, i.e. it does not matter how many or how few judgments are made, but the documents that are judged should not be correlated with the documents in a particular retrieval method. Having complete judgments ensures that there is no selection bias, but pooling with sufficiently diverse pools has been shown to be a good approximation [5].

TREC-like collaboration is clearly needed to conduct the extensive tests required to address stability issues of MIR test collections and to obtain sufficiently diverse pools. Such tasks are formidably difficult for individual researchers to accomplish at a smaller scale.

2 Selection bias occurs when the subjects studied are not

representative of the target population about which conclusions are to be drawn.

5. CONCLUSION With no available standardized test collection for MIR evaluation, we have shown how individual researchers have developed state-of-the-art test collections for the purpose of metric scientific evaluation. However, there are a number of potential problems that are generic to these various small-scale test collections and we believe that these can be overcome through TREC-like collaboration. Collaboration could alleviate problems in the following areas:

1. Resources (Collection and queries)

The collection sizes used are a fraction of real-word music repositories [2] and much larger collection sizes are needed. The difficulties in query acquisition may be overcome obtaining real-world queries such as those available with music libraries, radio stations, scanning or encoding documented themes. Extensive studies in obtaining real-world error models of music queries would be one approach towards generating large music query repository. A collaborative effort would be needed to identify potentially relevant documents (via pooling) and more resources to assess the relevance of the pooled documents. Extensive testing of the stability of the system rankings is required. A test bed created in this manner and made available would further the whole research field of Music IR

2. Relevance

Perhaps within the QBM task, relevance assumptions such as those already used in the studies thus far could be expanded. A comprehensive representation of user classes may be needed to reach a consensus on the relevance based on a task. Various tasks need to be identified and the relevance defined accordingly. Relevance based on a scale is one possibility, with human musical perception of similarity is notoriously difficult to model.

3. Copyright

TREC mainly uses old newspaper articles, which have no real commercial value, and hence distribution is not problematic. Music pieces have a value, and the test bed would need to be protected, either through drastic licenses specifying the legal use of the data, or perhaps by storing the test collection in a centralized secure environment that offers computing services such as the downloading of search engine indexing and retrieval code, which is then executed at the repository and receives the ranked lists. Alternatively, one could get preprocessed “features” of music pieces which are commercially not relevant (not the original music piece from the audio): a midi-like representation, the volume footprint, a rhythm or pitch extract etc, or the test collection consists of “half” music pieces, the first 60 seconds of each 120 seconds segment of the music piece etc.

4. Generic modules

Collaboration does not have to end with the creation of test collection. With the complexities of music data, a collaborative effort to develop modules that create features, extract melodies, preprocess data that requires expertise from various disciplines would certainly ease MIR research tasks. For example, groups with an expertise in IR of symbolic representation might well benefit from preprocessing that translates raw audio into symbolic forms such as MIDI.

95

6. ACKNOWLEDGMENTS This work is partially supported by the EPSRC, UK.

7. REFERENCES [1] J. Stephen Downie, Panel on Music Information Retrieval

Evaluation Frameworks, 3rd International Conference on Music Information Retrieval, ISMIR2002, Paris, France, pgs 303-304.

[2] J. Stephen Downie, Music Information Retrieval, Annual Review of Information Science and Technology 37:295-340.

[3] J, Stephen Downie, Evaluating A Simple Approach to Music Information Retrieval: Conceiving Melodic N-Grams as Text, PhD Thesis, University of Western Ontario, 1999.

[4] Lesk, M., & Salton, G, Relevance Assessments and Retrieval System evaluation. Information Storage and Retrieval, 4, 343-359, 1969.

[5] Ellen M. Voorhees, Wither Music IR Evaluation Infrastructure: Lessons to be Learned from TREC, Panel, JCDL 2002 Workshop on MIR evaluation.

[6] Ellen M. Voorhees, Evaluation by Highly Relevant Documents, SIGIR 01, pgs 74-81.

[7] Ellen M. Vorhees, Variations in Rrelevance Judgements and the Measurement of Retrieval Effectiveness, Information Processing and Management, 36: 697-716, 2000.

[8] Amanda Spink and Howard Greisdorf, Regions and Levels: Measuring and Mapping Users’ Relevance Judgements, Journal of the American Society for Information Science and Technology, 52(2): 161-173, 2001.

[9] Stephen P. Harter and Carol A. Hert, Evaluation of Information Retrieval Systems: Approaches, Issues, and Methods, Annual Review of Information Science and Technology, Volume 32, 1997

[10] Cyril Cleverdon, The Significance of the Cranfield Tests on Index Languages, SIGIR ’98.

[11] C.J. van Rijsbergen, Information Retrieval, online book.

[12] Alexandra Uitdenbogerd, Music Information Retrieval Technology, Phd Thesis, Royal Melbourne Institute of Technology, 2002.

[13] The MIR/MDL Evaluation Project White Paper Collection, Edition #2, http://www.music-ir.org

[14] Don Byrd, Candidate Music Test Collections, Background document for ISMIR 2000 on Music Information Retrieval Evaluation, The First International Symposium on Music Information Retrieval, ISMIR 2000, Plymouth, Massachussetts, USA, Oct 23rd – 25th 2000.

[15] Shyamala Doraisamy and Stefan Rüger, An Approach Towards A Polyphonic Music Retrieval System, 2nd

International Symposium on Music Information Retrieval, ISMIR 2001, Indiana, USA, pgs 187-193.

[16] Shyamala Doraisamy and Stefan Rüger, Robust Polyphonic Music Retrieval with N-Grams, Journal of Intelligent Information Systems, 21:1, 53-70, 2003.

[17] Shyamala Doraisamy and Stefan Rüger, A Comparative and Fault-Tolerance Study of the Use of N-Grams with Polyphonic Music, 3rd International Conference on Music Information Retrieval, ISMIR2002, Paris, France, pgs 101-106.

[18] Rodger J. McNab, Lloyd A. Smith, Ian H. Witten, Clare L. Henderson and Sally Jo Cunningham, Towards the Digital Music Library: tune Retrieval from Acoustic Input, DL ’96, Bethesda MD, USA.

[19] Alain Bonardi, IR for Contemporary Music: What the Musicologist Needs, ISMIR 2000.

[20] Harold Barlow and Sam Morgenstern, A Dictionary of Musical Themes, London: Ernest Benn, 1949.

[21] Naoko Kosugi, Yuichi Nishihara, Tetsuo Sakata, Masahi Yamamuro and Kazuhiko Kushima, A Practical Query-By-Humming for A Large Music Database, ACM Multimedia 2000, Los Angeles, CA., Nov. 2000.

[22] L. Hirshman. Language understanding evaluations: Lessons learned from MUC and ATIS. In Proceedings of the First International Conference on Language Resources and Evaluation (LREC), pages 117-122, Granada, Spain, may 1998.

[23] Thomas Sødring and Alan F. Smeaton, Evaluating a Music Information Retrieval System – TREC Style, Panel Discussion, 3rd International Conference on Music Information Retrieval, ISMIR2002, Paris, France.

[24] Jeremy Pickens, Juan Pablo Bello, Giuliano Monti, Tim Crawford, Matthew Dovey, Mark Sandler and Don Byrd, Polyphonic Score Retrieval Using Polyphonic Audio Queries: A Harmonic Modeling Approach, 3rd International Conference on Music Information Retrieval, ISMIR2002, Paris, France pgs 140-149.

[25] Salton, Gerard, and Michael J. McGill. Introduction to Modern InformationRetrieval. New York: McGraw-Hill, 1983.

[26] Cyril Cleverdon, Evaluation Tests of Information Retrieval Systems, Journal of Documentation, Vol 26, No. 1, March 1970, pgs 55-67.

96

If It Sounds As Good As It Looks: Lessons Learned From Video Retrieval Evaluation

Abby A. Goodrum School of Information Studies

Syracuse University Syracuse, NY 13244-4100

+1 (315) 443-5602 [email protected]

ABSTRACT In many ways, music information retrieval (MIR) bears a closer resemblance to video information retrieval (VIR) than it does to text retrieval. Both music and video provide rich, complex sources of information having their own semantics. Both share the challenges of digitizing, segmenting and streaming, joined by problems relating to the representation of non-textual, non-verbal information. Because of this complexity, systems for the retrieval of these media pose unique challenges to evaluation including the construction of large testbeds, the crafting of representative topics for searching, and identification of appropriate metrics for evaluation. This paper will discuss recent efforts in video retrieval evaluation and how these efforts might inform the creation of an experimental evaluation environment for Music Information Retrieval.

1. INTRODUCTION

Evaluation is the process of examining an entity (subject, process, system, technique, etc.) and assessing or appraising it based on its important features. We determine how much or how little we value an entity, arriving at our judgment on the basis of criteria that we can define and measure. As such, evaluation is a critical component in the progress of science and technology.


The field of IR has a long tradition of evaluation used to compare the relative performance of different system designs. For the past 12 years, the National Institute of Standards and Technology (NIST), part of the US Department of Commerce, has been running a series of conferences on text retrieval (TREC). TREC has gathered together large collections of text, spoken audio, web and video information with a view to supporting research into information retrieval (IR). TREC provides an infrastructure and mechanisms for the comparative evaluation of IR systems that have greatly increased our understanding of IR over the last decade. This is no ‘vulgar pick the winners approach,” [12] but is instead an attempt to uncover the best solutions to support the information needs of users.

All evaluative endeavors require the identification of suitable criteria for evaluation, measures and instruments for measuring the criteria, and methodologies for conducting evaluation experiments. Several scholars have pointed to problematic issues within IR evaluation [4,7,8,13,14], and I will not attempt a complete recitation of those issues here. I wish to focus, instead, on those evaluation issues which are shared by video information retrieval and music information retrieval. I do not claim that the issues discussed here are not pertinent for text information retrieval researchers, but the focus will be on how these challenges relate specifically to video and music IR. Specifically, I will discuss the unique methodological problems of constructing large media test collections, crafting statements of information need, and selecting appropriate mechanisms for analysis of results.

97

As a first step, it is important to examine the unique characteristics of music and video information sources that distinguish them from textual information sources.

2. SHARED CHARACTERISTICS

Digitized music and video are rich and complex sources of information. While they may convey information similar to that contained in texts however, they do not communicate in the same fashion as texts. Moreover, both music and video impart information sequentially over time at varying levels of complexity.

For example, video is composed of approximately 25-30 individual images or frames per second. Each frame contains information relating to both low-level and high level features including shapes, textures, colors, brightness, and the position or location of complex objects such as people, places, and things. An unbroken sequential string of frames taken from the same camera defines a shot and includes transitions such as fades, hard cuts, dissolves, and wipes. Successful shot boundary detection paired with feature extraction is a prelude to scene detection, wherein an object of interest is visible across multiple shots that do not necessarily occur contiguously. Correctly identifying high-level video structures is a difficult task hence automatic detection of fundamental units, such as shots, from the stream is vitally important.

Similarly, music information contains both high and low-level features such as notes, dynamics, intervals, key, loudness, timbre, melody, rhythm, pitch, voice, instruments, timing, noise, tonality, themes, etc.. This is complex enough when discussing monophonic music (one note occurring at a time), but the level of complexity increases substantially when considering polyphonic music that may be comprised of multiple voices and instruments simultaneously. Identifying appropriate mechanisms for feature extraction and segmentation are crucial here as in video.

In addition to having complex data structures, digitized music and video have file structures and file sizes that impact compression, storage and retrieval. Uncompressed music files are large, take up memory, and are slow to search and download. Compression makes them easier to transmit and store, but results in some loss; typically inaudible of redundant data. For example, one minute of CD quality music in mp3 format is roughly equivalent to 1 MB. To determine the file size of one second of uncompressed video, multiply the image size by the number of frames per second (fps). For example, one second of uncompressed, full-size, full-speed (30 fps), 24-bit video is: 900K x 30 = 27 MB or 2.7 MB using a compression ratio of 10:1.

The challenges of digitizing and segmenting are joined by problems relating to the representation of non- textual, non-verbal information. Finding, for example, all instances of a certain pitch, harmony, rhythm, shape, texture, or visual object challenges IR systems built to essentially match words in a query to words in a collection. The problem is essentially one of representational congruity.

Representation is a central concept in information retrieval, and occurs at several stages in the process. On one level, authors/composers/filmmakers represent their ideas and knowledge as documents such as articles, books, scores, and films. Similarly, users' represent complex goals, problems and knowledge gaps as information needs. At another level, representations of documents are matched against representations of the users' information need as expressed in queries. Successful retrieval is predicated on the extent to which representations actually share in the nature of the thing being represented. There are three areas that affect the outcome of the IR system with respect to representation:

?? The extent to which document representations share congruence with the documents for which they stand.

?? The extent to which queries share congruence with the information needs for which they stand.

?? The extent to which queries and document representations share congruence with each other.

Representations for documents function not only as attributes against which a query may be matched, but also provide support for browsing, navigation, relevance judgments, and query reformulation. It is important to note, however, that the representation of a users' information need as expressed in a query is largely driven by system parameters for query construction and for representation of retrieved documents. Only in the last decade have users been able to query by humming or query by submitting an image exemplar. Thus, recent advances in the technology for signal processing and pattern matching have been driving decisions about MIR and VIR system design rather than an understanding of user needs and user behaviors in these retrieval environments. This is not to suggest that we embrace an either-or scenario for video and music IR research. Both system and user - centered approaches are needed at this early stage.

Both video and music are multidimensional and require multiple features in order to represent a ‘document,’ but there is sparse research to drive our understanding of which features are most useful for searching, sorting, ranking, navigating and relevance judgments. There has been a great deal of research demonstrating that relevance is a complex, multi-faceted relationship among information needs, information tasks, information environments, documents, document attributes, search processes and search interfaces [9]. Increasingly we are coming to understand that the criteria for relevance must be defined within a specific task context

98

before evaluation can occur. For video and music IR we lack a taxonomy of tasks and a taxonomy of queries related to these tasks. For appropriate evaluation design, we must also identify the various features that support relevance judgments for those tasks.

Relevance is also central to the two measures used in most IR evaluation studies: recall and precision. Recall is the proportion of relevant document that are retrieved, and precision is the proportion of retrieved documents that are relevant. In VIR and MIR these measures raise a number of questions including, (but not limited to): ?? What constitutes a document for the purposes of

computing precision and recall?

?? What portion of the document constitutes an ‘answer’? ?? Is the entire video or score relevant to an information

need, or only that segment containing the feature of interest?

?? How do we assess the performance and contribution of different types of features?

Given the sparsity of empirical understanding in these areas, it would seem daunting to attempt any large scale comparative evaluation across systems. Nevertheless, in an effort to move VIR research forward, the VIR community began planning for and developing what later became the video track at TREC. A brief overview of the evolution of this endeavor may provide a framework for discussion of ways that the MIR community can prepare to move in a similar direction.

3. TREC VIDEO TRACK

For many years, VIR researchers constructed their own testbeds of digitized video to support their projects. These proprietary collections in many cases contained copyrighted material, or material lacking documented intellectual property rights. For example, the Informedia Project at Carnegie Mellon built over a terabyte digital video library of data from CNN, The Discovery Channel, and public television [16]. Each VIR project also created a variety of surrogates and indexing schema to support the focus of their particular research goals.

By the late 1990’s the video information retrieval community was beginning to call for a large-scale open test environment for evaluating the various content-based retrieval systems that had emerged from research labs around the world. One of the central challenges in this was the creation of a large collection of freely available, public domain digitized video covering a diverse range of topics and representative of real retrieval environments.

At the Twenty-Second Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, in Berkeley, California, the National Institute of Standards and Technology (NIST) publicized the release of the first installment of a public domain digital video test collection on DVD [10]. At about the same time, the Open Video Project at the University of North Carolina also made available a large collection of digitized video for use by the VIR community [5,11].

Starting in 2001, TREC sponsored the first research track devoted to research in automatic segmentation, indexing, and content-based retrieval of digital video. Currently in its third year, the goal of the TREC video track has been to promote progress in content based retrieval from digital video via open, metrics based evaluation [15]. Beginning in 2003, this track will become an independent 2-day evaluation workshop taking place each year just before TREC.

From the start, the intent of the TREC video track was to allow the methods and procedures for evaluation to evolve from year to year, based on participant feedback. In two short years, the video track participants have made substantial progress in creating mechanisms for the evaluation of video retrieval systems. The amount of digitized video used for the test collection has grown from 11 hours of video data in 2001 to approximately 73 hours of publicly available VCD/MPEG-1 from NIST, the Internet Archive and the Open Video Project. The subject material in the test collection has also been expanded and diversified and includes both black & white and color films produced between the 1930s and 1970’s. The subject matter now ranges across topics taken from educational, industrial, amateur, public service, and marketing films. The content of the topic collection and the process for determining relevance have also evolved substantially thanks to the combined efforts of the video track participants and other members of the video retrieval community. What follows is a brief overview of the evolution of the TREC video track over the past two years.

Tasks

The tasks have changed from the first year. In 2001 the tasks were shot boundary detection, known item search, and general search. Both search tasks could be conducted in either automatic of interactive mode, and systems were allowed to use transcripts created from automatic speech recognition. Participants in the 2002 video track took part in three tasks: shot boundary detection, feature extraction (new in 2002), and searching. The searching task was modified after 2001 to exclude fully automatic topic-to-query translation, and the known item and topic searching has been discontinued.

Shot Boundary Detection

Although single keyframes have been used successfully as surrogates for video retrieval and relevance judgments, an

99

important challenge in video processing for many years has been the ability to automatically discern different types of shots. A shot is defined as an unbroken sequence of frames taken from one camera and includes shot transitions such as fades, hard cuts, dissolves, and wipes. In the camera, a shot is what is recorded between the time the camera starts rolling and the time it stops rolling. In collections that use shot-level description, such as stockshot libraries or television archives, there may be an initial edit of the raw material, for example to discard technically imperfect or unusable parts of a shot. As a result, the shot found in the information system may be different from the one that originally came from the camera.

Data for the shot boundary task in 2001 was comprised of 5.8 hours (3.34 gigabytes) of MPEG-1 encoded video. In 2002 this was cut to 4 hours and 51 minutes of MPEG-1 encoded video. All transitions were identified and classified beforehand by NIST. In the course of running the shot boundary task in 2001, participants discovered that different MPEG-1 decoders were producing varying frame numbering from the same source video files. Work around solutions were proposed by participants and modifications were made to the protocols used for comparing submissions against reference data.

Shot detection is also an important prelude to the evaluation of search performance as well. Search runs typically point to ‘answers’ found within multiple shots that must be well-defined and robust in order to support comparison against the search runs of competing systems. Problems arose in 2001 wherein the shot boundaries for search tasks were defined by participants making comparison across systems difficult. For this reason, predetermined shot definitions were used by all groups in 2002 for the evaluation of feature detection and search tasks.

Feature Extraction

The ability to automatically identify the presence of high-level features such as “People”, “Indoor/Outdoor”, “Text Overlay”, and “Instrumental Music” etc. represents a significant achievement in video retrieval. Additionally, these features may serve as a basis for enhanced search capabilities

The semantic feature extraction task was introduced in 2002 and will be continued in 2003. The objectives of this task were to begin the benchmarking process for evaluating effectiveness of detection methods on different features, and to allow for features extracted from this task to be available to participants in the search task as part of queries. During on-line discussions by track participants, a simple set of features were chosen that had the greatest potential given current system abilities. These features were suggested by participants before the topics for search were known.

Topics

Ideally, topics should be taken from real users interacting with the same collection used in the TREC video track. This was not an option however, so statements of information need (topics) in the first year of the TREC video track were created by participants and by NIST. NIST was also responsible for making some revisions and eliminating duplicate topics. Topics were subdivided into known-item and general searches. All topics were pooled and all systems were expected to run on all topics. Each topic statement included a textual description of the information needed and one or more media exemplars (audio, video, still image). Relevance of both known item search topics and general search topics was assessed by NIST.

For the TREC video track in 2002, 25 topics were created by NIST to represent the needs of a trained user seeking material for reuse in a large video archive. Known item topics were discontinued. As before, the statements included multimedia exemplars. An as yet unsolved problem remains: the topic statements are created from observation of the collection and may be biased toward text or audio accompanying some of the videos.

Search

Research examining human video retrieval interaction is still in its infancy, and documentation of how users formulate and modify queries using multiple media is scant. Moreover, a taxonomy of actual query types across diverse disciplines, collections, user characteristics, etc. has yet to be developed.

From the start, the TREC video track community recognized the need to incorporate human cognition into the search task. Searching is subdivided into two approaches: a ‘manual’ approach whereby a human searcher formulates a single query optimized for a specific search system based on the topic description. And an ‘interactive’ run in which a human searcher generates an initial query and then refines that query based on initial search output. In addition to calculating mean average precision, the interactive runs also measure total elapsed time for each search.

Although the first TREC video track in 2001 provided for the evaluation of fully automated search runs, this has been set aside for the time being and will be revisited in a future workshop.

Systems may be developed with knowledge of the test collection beforehand groups may also take advantage of the features donated by groups participating in the feature extraction task. The diversity of searcher and search interface interactions as well as the variability across topics makes a comparison across systems somewhat difficult and a goal for 2003 is to explore methods to improve this state of affairs.

Given that the TREC video track is still evolving and defining the methods, data, parameters, criteria, measures and procedures for

100

conducting comparative evaluation across VIR systems, what can the MIR community learn from their experiences?

4. TREC MUSIC TRACK?

A diverse range of systems have so far been developed for music information retrieval, including systems that retrieve from MIDI representations, monophonic transcriptions, scores, thematic catalogs, and raw audio files [2,6]. A preliminary question to ask is whether a sufficient number of systems having comparable approaches exist. Furthermore, would the creators of these systems be interested in participating in an evaluative study?

At the same time, research examining music information behaviors is sparse [1]. A second preliminary question to ask is whether enough is known about music information seekers/users to begin to create topic statements and relevance assessments. Task definition is vital: who are your users and what do they need to do? Knowing this makes a difference in the type of collection you create, the types of queries you support, and the nature of relevance judgments. Given affirmative answers to the above and drawing upon the experiences of the TREC video track participants, what should the MIR community do next?

The first effort should be the creation of a large, easily accessible collection of music This testbed collection for MIR should reflect the diversity of real collections. It should contain recordings of various length, from a range of genres, artists, and instruments. There should be multiple recordings of the same composition as well as variations on themes. The recordings should be supported by additional resources such as commentary, critique, indexing, scores, annotations, and other metadata describing the files. The collection must be free of intellectual property usage restrictions.

Next, the MIR community must develop a test collection of complex and diverse queries – including humming by real people, real queries, known item queries, general pattern matching and genre queries, etc. If possible, these queries should be generated by real users interacting with real collections and their relevance judgments should be captured along with their search processes.

Finally, don’t try to do it all in the first year! The video track in TREC is still evolving and will continue to do so for many years to come. Similarly, the complex issues surrounding music information retrieval evaluation will not be solved or settled in a single workshop. That should not stop the MIR community from moving forward – with caution and an eye toward iterative improvement and gradual understanding of the complexities of evaluation in multimedia environments.

5. CONCLUSIONS

Content based video IR and music IR are in their infancy. We have only recently moved from a bibliographic paradigm rooted in text retrieval to the development of retrieval systems based on visual and audio feature matching. Furthermore, we are still a long way from developing systems that are capable of human-like understanding of audio and video. An important step towards this understanding will be to expand our knowledge of user interactions with these media and to incorporate this knowledge into the design and evaluation of systems for retrieval.

While a TREC music information retrieval track is a good idea, we should remember that this is not the only way (and possibly not the best way) to evaluate MIR systems. Other approaches including case studies of MIR systems use in multiple organizational settings, qualitative studies of MIR system users and of music information seeking behavior outside of specific systems will also yield useful insights. In addition to precision and recall, other metrics should be explored such as those relating to browsing, navigation, and music understanding. Similarly, evaluation should be focused at different levels of feature complexity.

Finally, although we certainly wish to conduct comparative evaluation on the newer content-based technologies for music information retrieval, we should not abandon older text -based approaches entirely. A great deal of music is still cataloged and indexed in this fashion and will be for many years to come. Moreover, inclusion of bibliographic approaches allows for wider participation in MIR system research and development by a more interdisciplinary MIR community [3].

6. REFERENCES

[1] Cunningham, S.J. (2002). User studies: A first step in designing an MIR testbed. Papers Presented at the Workshop on the Creation of Standardized Test Collections, Tasks, and Metrics for Music Information Retrieval (MIR) and Music Digital Library (MDL) Evaluation, 18 July, 2002. pp. 19-21

[2] Downie, J.S. (2000) Access to Music Information: The State of the Art. Bulletin of the American Association of Information Science & Technology. Vol 26(5), june/july 2000.

[3] Futrelle, J. & Downie, J.S. (2002). Interdisciplinary communities and research issues in music information retrieval. Proceedings of the Third International Conference on Music Information Retrieval: ISMIR 2002 Paris, France, October 13-17, 2002, pp. 215-221

[4] Ingwerson, P. (1992). Information retrieval interaction. London: Taylor Graham.

[5] Open Video Project <http://openvideo.dsi.internet2.edu/>

101

[6] Reiss, J.D. & Sandler, M.D. (2002). Benchmarking music information retrieval systems. JCDL Workshop on the Creation of Standardized Test Collections, Tasks and Metrics for Music Information Retrieval (MIR) and Music Digital Library (MDL) Evaluation. Portland, Oregon, 2002.

[7] Robertson, S.E. & Hancock-Beaulieu, M.M. (1992). On the evaluation of the IR systems, IP&M 28, (4), 457-466.

[8] Saracevic, T. (1995). Evaluation of evaluation in information retrieval. Proceedings of ACM SIGIR ’95, Seattle, WA. Pp. 138-146.

[9] Schamber, L. (1994). Relevance and information behavior. In Williams, M. (Ed.) Annual Review of Information Science & Technology, vol 29, (pp3-48). Medford, NJ: Learned Information.

[10] Schmidt, C. & Over, P. (1999, August). Digital Video Test Collection. In proceedings of the Twenty-Second Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, California, USA.

[11] Slaughter, L., Marchionini, G., & Geisler, G. (2000). Open video: A framework for a test collection. Journal of Network and Computer Applications, 23.

[12] Spark Jones, K. (1995). Reflections on TREC. Information Processing & Management, 31 (3). 291-314.

[13] Spark Jones, K. (Ed.) (1981). Information Retrieval Experiment. London: Butterworths.

[14] Tague-Sutcliffe, JM.. (1992). The pragmatics of information retrieval experimentation, revisited. IP&M 28(4), 467-490.

[15] TRECVID website:< http://www-nlpir.nist.gov/projects/trecvid/

[16] Wactlar, H., Hauptmann, A., Gong, Y., Christel, M., (1999). Lessons Learned from the Creation and Deployment of a Terabyte Digital Video Library IEEE Computer 32(2): 66-73.

102

Comparison of User Ratings of Music in Copyright-freeDatabases and On-the-market CDs

Keiichiro HoashiKDDI R&D Laboratories, Inc.2-1-15 Ohara KamifukuokaSaitama 356-8502 Japan

[email protected]

Kazunori MatsumotoKDDI R&D Laboratories, Inc.2-1-15 Ohara KamifukuokaSaitama 356-8502 Japan

[email protected]

Naomi InoueKDDI R&D Laboratories, Inc.2-1-15 Ohara KamifukuokaSaitama 356-8502 Japan

[email protected]

ABSTRACTWe have been conducting research to develop a music infor-mation retrieval system which retrieves music based on userpreferences. In order to conduct evaluation experiments,it is necessary to accumulate an experiment set of musicdata, and collect user ratings of each music data included inthis data set. Two music data sets were prepared for userevaluation: the RWC Popular Music Database, which is acollection of 100 copyright-free pop songs, and a collection ofsongs recorded in on-the-market CDs. Comparison of userratings towards the songs in each data set shows that userratings of the RWC songs were far lower than that of thesongs in the other music data set. Based on this experience,the artists believe that the construction of a standard list ofon-the-market CDs is necessary for the development of anentertaining MIR system.

General TermsExperimentation.

1. INTRODUCTIONThe main task of conventional music retrieval systems is

to retrieve a music data, which matches the request of a user.The importance of such systems is expected to increase dueto the rapid spread of digital music data formats such asMP3. However, conventional systems can only be used tosearch a particular song from a database.

We have been conducting research regarding a music infor-mation retrieval (MIR) method based on music preferencesof users[1][2]. Such a system will enable users to discover

good songs which they have never heard before, from a largemusic database.

User preferences of music songs are necessary to evalu-ate the effectiveness of proposed algorithms. Therefore, wecollected user ratings for songs included in two music datasets: one is a collection of copyright-free pop songs, and the

Copyright is held by the author/owner.ACM 0-89791-88-6/97/05.

other is a collection of songs from CDs on the market. Asignificant difference was observed between the user ratingsof the songs in the two data sets.

Based on the results of user rating data collection exper-iments, we make a proposal to the MIR community in thispaper to build a standard collection of on-the-market CDswhich can be used for MIR-related research purposes. Inthe next section, a brief explanation of our MIR algorithmis presented. This explanation is followed by a description ofthe music data sets used for the user rating data collectionexperiments, and an explanation of the experiments. Ourproposal follows the experiment descriptions.

2. MUSIC INFORMATION RETRIEVAL BASEDON USER PREFERENCES

In this section, we will make a brief explanation of our re-search, which projects the development of an MIR algorithmbased on user preferences.

2.1 Tree-based vector quantizationOur MIR algorithm is based on the tree-structured vector

quantization method (TreeQ), developed by Foote[3]. Theapproach of the TreeQ method is to “train” a vector quan-tizer instead of modelling the sound data directly. Figure 1illustrates the concept of the TreeQ method.

As illustrated in Figure 1, each audio datum in the train-ing data set, which is a collection of audio data associatedwith a class such as artist or genre, is first parameterizedinto a spectral representation, by calculating mel-frequencycepstral coefficients (MFCC)[4]. More specifically, each au-dio waveform, sampled at 44.1 kHz, is transformed into asequence of 13-dimensional feature vectors (12 MFCC coef-ficients plus energy).

Once all training data has been parameterized by the cal-culation of MFCC, a quantization tree is generated offlinebased on these MFCCs, and category data labeled to eachitem in the training data set. The resulting tree is optimizedso that it attempts to put samples from different trainingclasses into different bins (leaves) as much as possible. Ahistogram of an audio file can be generated by looking atthe relative frequencies of samples in each quantization bin.The relative frequency can be considered as the probabilityof a data sample to end up in a certain leaf. For example, if20 of 100 samples input into the quantization tree are clas-sified in leaf i, the relative frequency of leaf i is 0.20. Ifthe resulting histograms are considered as vectors, typicalvector similarity measures such as the cosine measure can

103

Waveform

MFCC Computation

MFCCs

Quantizevia Tree

AccumulateHistograms

Figure 1: Outline of tree-structured vector quanti-

zation method

be applied to calculate the similarity between any incom-ing audio data and category vectors. This method has beenused for music and audio retrieval experiments. In Reference[3], experiments were conducted to retrieve short audio datasuch as oboe sounds and human laughter. The same paperalso reports experiments to retrieve music data of a specifiedmusical genre, such as jazz, pop, rock, etc. as possible.

Our method is to associate user preferences as categoryinformation of the training music data set. For example,N “good” songs and M “bad” songs are input to the treegeneration process to build the VQ tree. Next, user profiles,i.e., vectors which represent “good” and “bad” songs, aregenerated by inputting all “good” or “bad” songs throughthe VQ tree. Vectors of all other music data are also gen-erated by inputting the data through the VQ tree. Scoresof each music data are calculated by measuring the vectorsimilarity between each data and the user profiles.

2.2 Relevance feedbackDue to the complexity and ambiguity of music and users’

musical preferences, and the limited amount of learning data,it is not realistic to expect the previously described musicretrieval methods to achieve satisfactory performance con-sistently for all users. Therefore, there may be situationswhere users are unsatisfied with the system’s retrieval re-sults, and feel the need to provide additional informationregarding their preferences. The most efficient way to im-plement this is to collect relevance feedback from the user,and update the user profile based on the collected relevancefeedback information, which is a widely used method in textIR.

As described in Section 2.1, each category C is expressedby a vector ~C = (c1, . . . , cn), where n is the number of his-togram bins, and ci is the relative frequency of samples inbin i. Since ci is considered as the probability of a trainingdata sample to be categorized in bin i, ci can be calculatedby the following formula:

ci =|ci|

Pn

i=1|ci|

(1)

Quantizevia Tree

UpdatedHistogram

RelevanceFeedbackData

Figure 2: Outline of relevance feedback method

where |ci| expresses the number of data samples classified tobin i in the learning phase.

Relevance feedback is implemented by adding “relevant”data to each category vector. For example, if a song re-trieved by the pilot search is preferred by the user, this songcan be considered as relevant to category Cg. If the rele-

vant song Kg is expressed as ~Kg = (kg1, . . . , kgn), and the

updated category vector is expressed as ~C′g = (c′

g1, . . . , c′

gn),relevance feedback is implemented by the following formula:

c′

gi =|cgi| + |kgi|

Pn

i=1(|cgi| + |kgi|)

(2)

In other words, the updated category vector is obtained byaccumulating the relevant data set on the original categoryhistogram. This concept is illustrated in Figure 2.

Songs which the user dislikes can also be used to updatecategory Cg, as in the γ factor of Rocchio’s algorithm[5],where information derived from non-relevant documents areapplied to relevance feedback. The following formula definesthe implementation of a non-relevant song Kb to updatecategory vector ~Cg.

c′

gi =|cgi| + |kgi| − |kbi|

Pn

i=1(|cgi| + |kgi| − |kbi|)

(3)

As in Rocchio’s algorithm, information from non-relevantdata is simply subtracted from the original vector. The re-sulting bin value is set to 0, if the bin value resulting fromFormula 3 is negative. A similar approach can be imple-mented to update category vector ~Cb, by considering “badsongs” as relevant, and “good songs” as non-relevant. Thismethod is expressed by the following formula:

c′

bi =|cbi| + |kbi| − |kgi|

Pn

i=1(|cbi| + |kbi| − |kgi|)

(4)

The updated category vectors are then used to recalculatescores of songs in the test data set. Details of the evaluationexperiments of our methods are to be published in Reference[2].

104

3. MUSIC DATA SETSA collection of user ratings for songs is necessary to eval-

uate the previously described MIR algorithms. We haveconstructed a user rating data collection based on two mu-sic data sets: the RWC Music Database: Popular Music,and a collection of songs recorded in CDs on the market.The following sections describe the two music data sets.

3.1 RWC Music Database: Popular Music“RWC Music Database: Popular Music” is a part of a mu-

sic data collection available to music researchers[6] 1. TheRWC Popular Music Database consists of 100 songs. Allsongs were composed and recorded for the purpose of in-clusion in the database. Furthermore, in order to representthe various styles of music, the developers of the databaseprepared as many professional composers, lyric writers, ar-rangers, singers, instruments, etc.

3.2 HMV data collectionThe other music data collection was constructed based

on weekly ranking data of CD albums at HMV Japan2, amajor Japanese online music store. HMV Japan providesan archive of weekly CD sales rankings on their Web site,which was used to accumulate the list of CDs used for ourexperiments.

First, all CD albums which were ranked in the weeklytop 10 rankings in the year 2001 was accumulated. We di-vided this CD collection into two subsets: CDs ranked fromJanuary to June 2001, and from July to December 2001.Furthermore, all “best” albums and “compilation” albums(i.e., collections of hit songs by various artists) were omittedfrom the list, in order to reduce the number of highly pop-ular songs. Next, all songs contained in the resulting CDcollections were extracted to construct the experiment dataset. As a result, a total of 756 songs were derived from theJan-Jun data set, which consists of 60 CDs, and 812 songswere derived from the Jul-Dec data set, which consists of 61CDs.

4. USER RATING COLLECTION EXPERI-MENT

4.1 MethodIn order to collect user ratings data, 12 subjects applied

subjective ratings ranging from 1 to 5 (Bad:1 ∼ Good:5) forall songs in each experiment data set. Songs of the RWCPopular Database were rated first, followed by the two HMVdata sets. Songs were presented to each subject in randomorder within each data set. Furthermore, metadata infor-mation such as song titles and artist names were removedso that the subjects could not rate songs without actuallylistening to them. However, the subjects were allowed tofast-forward (or rewind) through each song to listen to anyportion of the song.

4.2 ResultsThe ratio of all user ratings for each data set is written in

Table 1.

1Distribution of the RWC Music Database outside of Japanhas recently been started.2http://www.hmv.co.jp/, Contents in Japanese only.

Table 1: Ratio of user ratings per music data set

Rating RWC HMV(Jan-Jun) HMV(Jul-Dec)5 7.3% 14.0% 13.8%4 20.7% 26.4% 25.7%3 25.7% 30.9% 30.1%2 28.8% 21.5% 21.4%1 17.5% 7.2% 9.0%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

101 102 103 104 105 106 107 108 109 110 111 112

User ID

12345

Figure 3: Ratio of user ratings for RWC data

Results in Table 1 clearly show that user ratings are gener-ally higher for the HMV music data, compared to the RWCdata. If songs with ratings (1 or 2), 3, and (4 or 5) areregarded as “bad”, “fair”, and “good” songs, respectively,the average ratio of “good” songs in the HMV data set isapproximately 40%, while the same ratio for the RWC dataset is 28%. Furthermore, the ratio of “bad” songs in theHMV set is just below 30%, while the “bad” song ratio forthe RWC set is about 44%.

For further analysis of user ratings, the ratio of ratings inthe RWC and HMV data sets for each user is illustrated inFigures 3 and 4, respectively.

Figures 3 and 4 show the wide variety of ratings betweensubjects. However, these results also show that not one sub-ject has provided higher ratings to the RWC data comparedto the HMV data. Furthermore, there were 4 subjects whorated less than 10% of all RWC songs as “good” songs.

Rating results for the RWC data set made it difficult toconduct MIR experiments based on the RWC data, since ouralgorithm requires learning data to generate a VQ tree, and

relevance feedback data to improve precision of retrieval. Itis obvious that the number of relevant (i.e., “good”) songsis insufficient for the 4 “fussy” subjects illustrated in Figure3. Therefore, we were forced to run our experiments basedon the HMV data set.

5. PROPOSALIn order to achieve further advancement on research of

MIR based on user preferences, it is obvious that a morelarge-scaled experiment is necessary. However, the manualconstruction of a large-scaled music data set is not only time-consuming, but also a very difficult task, if we are to makean objectively fair archive of music. We do believe thatour method to extract a list of CDs based on weekly salesranking data is a rather fair way to generate a music data set.However, the acculumated CD list can be considered as abiased data set, since consumers of CD shops may be biased

105

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

101 102 103 104 105 106 107 108 109 110 111 112

User ID

12345

Figure 4: Ratio of user ratings for HMV data

towards young people who prefer new music over traditionaltypes of music.

While copyright-free music data collections such as theRWC Database are definitely fair, our user rating experi-ments show that typical users simply do not like such music.Therefore, we do not expect the expansion of copyright-freemusic data (which is, of course, beneficial for conventionalMIR research, though) will contribute to the developmentof user preference based MIR systems.

Based on our experiences, our proposal to the MIR com-munity is to build a large list of on-the-market CDs whichcan be used as a standard set of music for MIR research.The number of CDs in this standard data set should be inthe order of thousands, in order to provide extensive datafor various research goals.

Generation of a music data set of this scale will certainlyrequire cooperation from the music industry. Unfortunately,due to recent situations regarding the illegal distribution ofcopyrighted music on the Net, many major record companiesare becoming more and more restrictive on the use of theircontents. We wish the MIR community will make a pushtowards the music industry to provide an abundant data setfor the development of MIR research.

6. CONCLUSIONIn this paper, we presented the results of user rating col-

lection experiments based on a copyright-free music dataset, and a music data set derived from weekly sales rankingdata of a major online CD shop. Results of our experi-ments clearly show that typical users apply higher ratingsfor songs extracted from on-the-market CDs, compared tosongs in copyright-free music data sets. Based on this expe-rience, we make a proposal to the MIR community to make astandard list of on-the-market CDs. We expect such a stan-dard data set to provide a major contribution not only forMIR based on user preferences, but also for other potentialMIR-related research issues.

7. REFERENCES[1] Hoashi, Zeitler, Inoue: “Implementation of relevance

feedback for content-based music retrieval based onuser preferences”, Proceedings of ACM-SIGIR 2002,pp 385-286, 2002.

[2] Hoashi, Matsumoto, Inoue: “Personalization of userprofiles for content-based music retrieval based onrelevance feedback”, to be published in Proceedings of

ACM Multimedia 2003, 2003.

[3] Foote: “Content-based retrieval of music and audio”,Proceedings of SPIE, Vol 3229, pp 138-147, 1997.

[4] David, Mermelstein: “Comparison of parametricrepresentations for monosyllabic word recognition incontinuously spoken sentences”, IEEE Trans.Acoustic, Speech, Singal Proc., ASSP-28(4), 1980.

[5] Rocchio: “Relevance Feedback in InformationRetrieval”, in “The SMART Retrieval System –Experiments in Automatic Document Processing”,Prentice Hall Inc., pp 313-323, 1971.

[6] Goto, Hashiguchi, Nishimura, Oka: “RWC MusicDatabase: Popular, classical and jazz musicdatabases”, Proceedings of ISMIR 2002, pp 287-288,2002.

106

Query by Humming: How good can it get? Bryan Pardo

EECS Dept, University of Michigan 153 ATL, 1101 Beal Avenue Ann Arbor, MI 48109-2110

+1 (734) 369-3207

[email protected]

William P. Birmingham EECS Dept, University of Michigan

110 ATL, 1101 Beal Avenue Ann Arbor, MI 48109-2110

+1 (734) 936-1590

[email protected] ABSTRACT When explaining the Query-by-humming (QBH) task, it is typical to describe it in terms of a musical question posed to a human expert, such as a music-store clerk. An evaluation of human performance on the task can shed light on how well one can reasonably expect an automated QBH system to perform. This paper describes a simple example experiment comparing three QBH systems to three human listeners. The systems compared depend on either a dynamic-programming implementation of probabilistic string matching, or hidden Markov models. While results are preliminary, they indicate existing string matching and Markov model performance does not currently achieve human-level performance.

1. INTRODUCTION Our research group is interested in Query-by-humming (QBH) systems that allow users to pose queries by singing or humming them. QBH systems search musical content. This is in contrast to approaches to music retrieval based on searching metadata, such as song title, genre, and so forth. The Music Information Retrieval (MIR) community has, of late, been rightly concerned with finding meaningful empirical measures for the quality of a MIR system that searches musical content. Those interested in such systems have focused on building large databases of monophonic songs. Experimental queries are often synthetic, and are generated from elements of the database [1-5]. Unfortunately, differences in database composition, representation, query transcription methods, ranking methods, and methodology in evaluation of results makes comparison between systems difficult, if not impossible. Further, we are unaware of any direct comparisons between automated MIR system performance and human performance. When describing the QBH task, it is typical to describe it in terms of a musical question posed to a human expert, such as a music-store clerk. An evaluation of human performance on the task can shed light on how well one can reasonably expect an automated QBH system to perform. How well, then, could one expect a human to perform? One can compare human and algorithmic performance, if database is limited to a set of pieces known to the human and the task is limited to that of “name that tune,” rather than ranking the full database. This paper describes a simple example experiment to compare three QBH systems against three human listeners. While the results are preliminary, it shows how to establish a human performance baseline for QBH systems.

2. THE SEQUENCE MATCHERS USED A string is any sequence of characters drawn from an alphabet, such as a sequence of notes in a written musical score, or notes transcribed from a sung query. String matchers find the best alignment between string Q and string T by finding the lowest cost (or, equivalently, highest reward) transformation of Q into T in terms of operations (matching or skipping characters). The score of the best alignment can be used as a measure of the similarity of two strings. Dynamic-programming based implementations that search for a good alignment of two strings have been used for over 30 years to align gene sequences based on a common ancestor [6], and have also been used in musical score following [7, 8, 9 ] and query matching [10, 4]. For this experiment, we chose to use the probabilistic string matcher described in [11] using both the global alignment algorithm and the local alignment algorithm, as described in [12]. Hidden Markov models, or HMMs, have been used relatively infrequently in the MIR literature [2, 13] to do melody recognition, but seem promising. We chose to represent targets using the HMM architecture described in Shifrin et al. The Forward algorithm [14] measures the similarity of a target string, represented as an HMM, to a query string by generating the probability the target generated the query. Given a query, Q, and a set of targets, {T1…Tn}, an order may be imposed on the set of targets by running the same scoring algorithm (global, local, or Forward) between Q and each target, Ti, and then ordering the set by the value returned, placing higher values before lower. We take this rank order to be a direct measure of the relative similarity between a theme and a query. The ith target in the ordered set is then the ith most like the query. Thus, the first target is the one most similar to the query, according to the given scoring algorithm.

3. EXPERIMENTAL SETUP Figure 1 outlines the experimental setup used to compare our system to the three human listeners. As can be seen from the figure, humans had the advantage in that they listened directly to recorded queries, rather than a sonification (such as a MIDI performance) of the transcribed queries. This was because, for this experiment, we were interested in the maximal performance that could be achieved by a human, without introducing error from pitch tracking, pitch quantization, and note segmentation. We note that the algorithms had to deal with errors introduced from these processing steps, which can be substantial.


107

kmedina

The remainder of this section describes our simple experiment to compare the performance of our complete string matcher and HMM systems against human performance.

Figure 1. Experimental Setup

3.1 Transcribed Target Database We used a corpus of 260 pieces of music encoded as MIDI from public domain sites on the Internet. The corpus is composed entirely of pieces that have been recorded by the Beatles. This includes all pieces recorded on albums for U.S. release and a number of “covers” they performed that were originally composed by other artists, such as “Roll Over Beethoven.” For a full list of pieces in the database, please consult the MusEn website at http://musen.engin.umich.edu/. We selected music performed by the Beatles because their music tends to be well known, the salient information to identify pieces tends to be melodic and relatively easy to sing, and the pieces are readily accessible, both as audio and as MIDI. Each piece in the corpus was represented in a database by a set of themes, or representative monophonic melodic fragments. The number of distinct “catchy hooks” decided the number of themes chosen to represent each piece. Of the pieces, 238 were represented by a single theme, 20 by two themes, and two pieces were represented by three themes, resulting in a database of 284 monophonic themes. These themes constitute the set of targets in the database. A sequence of <pitch-interval, InterOnsetInterval-ratio> {Pardo, 2002 #227} pairs was created and stored in the database for each MIDI theme. Themes were quantized to 25 pitch intervals and five log-spaced InterOnsetInterval-ratio intervals. Each theme was indexed by the piece from which it was derived. An HMM for each theme was then generated automatically from the theme sequence and placed in the database.

3.2 Query Corpus A query is a monophonic melody sung by a single person. Singers were asked to select one syllable, such as “ta” or “la”, and use it consistently for the duration of a single query. The consistent use of a single consonant-vowel pairing was intended to minimize pitch-tracker error by providing a clear starting point for each

note, as well as reducing error caused by dipthongs and vocalic variation. Three male singers generated queries for the experiment. Singer 1 was a twenty-two year old male with no musical training beyond private instrumental lessons as a child. Singer 2 was a twenty-seven year old male with a graduate degree in cello performance. Singer 3 was a thirty-five year old male with a graduate degree in saxophone performance. None are trained vocalists. All are North American native speakers of English.

Transcribed Query

Target Database

Forward Algorithm

String Alignment

Best Match

Document

Process

Key

Human Listener

Query Transcriber

Sung Query

Best Match

Best Match

Sung queries were recorded in 8 bit, 22.5 kHz mono using an Audio-Technica AT822 microphone from a distance of roughly six inches. Recordings were made directly to an IBM ThinkPad T21 laptop using its built-in audio recording hardware and were stored as uncompressed PCM .wav files. Each singer was allowed a trial recording to get a feel for the process, where the recorded melody was played back to the singer. This trial was not used in the experimental data. Subsequent recordings were not played back to the singer. Once the trial recording was finished, each singer was presented with a list containing the title of each of the 260 Beatles recordings in our database. Each singer was then asked to sing a portion of every song on the list that he could. Singer 1 sang 28 songs; Singer 2 sang 17; and, Singer 3 sang 28 songs. The result was a corpus of 73 queries covering forty-one of the Beatles’ songs, or roughly 1/6 of the songs in our database. Singers 1 and 3 sang 24 songs in common. Singer 2 had 8 songs in common with the other two. These 8 songs were common to all 3 singers. These songs are ‘A Hard Days Night,' 'All You Need Is Love,' 'Here Comes The Sun,' 'Hey Jude,' ‘Lucy In The Sky With Diamonds,’ ‘Ob-La-Di Ob-La-Da,’ ‘Penny Lane,’ and ‘Sgt. Peppers Lonely Hearts Club Band.’ These queries were then automatically pitch tracked, segmented and quantized to 25 pitch intervals and five IOI ratio intervals. This resulted in 73 query strings. These were used as the query set for all experiments. Mean query length was 17.8 intervals. The median length was 16. The longest query had 49 intervals, and the shortest had only two. The median number of unique elements per query sequence was nine.

3.3 Experimental Results We define the recognition rate for a system to be the percentage of queries where the correct target was chosen as the top pick. The highest recognition rate achieved by any of our systems was 71% by the local-string matcher on Singer 3. The lowest rate was 21% by the Forward algorithm on Singer 1. We created a baseline by presenting the sung queries to the singers who generated the query set to see how many of them would be recognized. Two months after the queries were made, the three singers were gathered into a room and presented the original recordings of the queries in a random order. Each recording was presented once, in its entirety. Singers were told that each one was a sung excerpt of a piece of music performed by the Beatles and that the task was to write down the name of the Beatles song the person in the recording was trying to sing. Only one answer was allowed per song and singers were given a few (no more than 15) seconds after each query to write down an answer. Once all queries had been heard, responses were graded. Recall that queries were sung with nonsense syllables and that lyrics were not used. Because of this, we judged any response that

108

contained a portion of the correct title or a quote of the lyrics of the song as a correct answer. All other answers were considered wrong. Table 1 contains the results of the human trials, along with the results for the automated QBH systems. As with the human trials, the automated algorithms were judged correct if the right answer was ranked first and incorrect otherwise. Each column represents the results for a query set. Each row in the table contains the recognition rates achieved by a particular listener or QBH system. The row labeled “Other 2 Singers” contains the average recognition rates of the two singers who did NOT sing a particular set of queries. Thus, for Singer 2’s queries, the “Other 2 Singers” value is the average of how well Singer 1 and Singer 3 recognized Singer 2’s queries.

Table 1. Human Performance vs. Machine Performance Singer 1 Singer 2 Singer 3 Mean

Singer 1 96% 71% 79% 82%

Singer 2 50% 82% 46% 59%

Singer 3 71% 76% 89% 79%

Other2 Singers 61% 74% 63% 66%

String Matcher (Global)

29% 24% 39% 31%

String Matcher (Local)

36% 41% 71% 49%

HMM (Forward) 21% 35% 68% 41%

N 28 17 28

It is interesting to note that the human listeners achieved an average recognition rate of 66%, when presented with queries sung by another person. This figure was lower than expected and may provide a rough estimate to how well one can expect a machine system to do. Even more interesting was the inability of Singers 2 and 3, both with graduate degrees in music performance, to achieve even a 90% recognition rate on their own queries, while Singer 1 achieved a much higher recognition rate on his own queries.

4. CONCLUSIONS This paper describes an experiment comparing three QBH systems to three human listeners. The systems compared depend on either a dynamic-programming implementations of probabilistic string matching, or hidden Markov models. While results are preliminary, they indicate existing string matching and Markov model performance does not currently achieve human-level performance. Future work in this project includes collecting more queries and listeners and having listeners attempt to recognize pieces from audio generated from query transcriptions.

5. ACKNOWLEDGMENTS We gratefully acknowledge the support of the National Science Foundation under grant IIS-0085945, and The University of Michigan College of Engineering seed grant to the MusEn project. The opinions in this paper are solely those of the authors

and do not necessarily reflect the opinions of the funding agencies.

6. REFERENCES [1] McNab, R., et al. Towards the Digital Music Library: Tune

Retrieval from Acoustic Input. in First ACM International Conference on Digital Libraries. 1996. Bethesda, MD.

[2] Meek, C. and W.P. Birmingham. Johnny Can't Sing: A Comprehensive Error Model for Sung Music Queries. in ISMIR 2002. 2002. Paris, France.

[3] Pickens, J. A Comparison of Language Modeling and Probabilistic Text Information Retrieval. in International Symposium on Music Information Retrieval. 2000. Plymouth, Massachusetts.

[4] Uitdenbogerd, A. and J. Zobel. Melodic Matching Techniques for Large Music Databases. in the Seventh ACM International Conference on Multimedia. 1999. Orlando, FL.

[5] Downie, S. and M. Nelson. Evaluation of a Simple and Effective Music Information Retrieval Method. in the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2000. Athens, Greece.

[6] Needleman, S.B. and C.D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 1970. 48: p. 443-453.

[7] Dannenberg, R. An On-Line Algorithm for Real-Time Accompaniment. in International Computer Music Conference. 1984: International Computer Music Association.

[8] Puckette, M. and C. Lippe. Score Following In Practice. in International Computer Music Conference. 1992: International Computer Music Association.

[9] Pardo, B. and W.P. Birmingham. Following a musical performance from a partially specified score. in Multimedia Technology Applications Conference. 2001. Irvine, CA.

[10] Hu, N., R. Dannenberg, and A. Lewis. A Probabilistic Model of Melodic Similarity. in International Computer Music Conference (ICMC). 2002. Goteborg, Sweden: The International Computer Music Association.

[11] Pardo, B. and W. Birmingham. Improved Score Following for Acoustic Performances. in International Computer Music Conference (ICMC). 2002. Goteborg, Sweden: The International Computer Music Association.

[12] Durbin, R., et al., Biological Sequence Analysis, Probabilistic models of proteins and nucleic acids. 1998, Cambridge, U.K.: Cambridge University Press.

[13] Shifrin, J., B. Pardo, and W. Birmingham. HMM-Based Musical Query Retrieval. in Joint Conference on Digital Libraries. 2002. Portland, Oregon.

[14] Rabiner, L. and B.-H. Juang, Fundamentals of Speech Recognition. 1993, Englewood Cliffs, New Jersey: Prentice-Hall.

109

Tracks and Topics: Ideas for Structuring Music RetrievalTest Collections and Avoiding Balkanization

Jeremy PickensCenter for Intelligent Information Retrieval

Department of Computer ScienceUniversity of Massachusetts, Amherst

[email protected]

ABSTRACTThis paper examines a number of ideas related to the con-struction of test collections for evaluation of music informa-tion retrieval algorithms. The ideas contained herein arenot so much new as they are a synthesis of existing propos-als. The goal is to create retrieval techniques which are asbroadly applicable as possible, and the proposed manner forcreating test collections supports this goal.1

1. INTRODUCTIONOne of the fundamental problems encountered by music

information retrieval system designers is that the representa-tions for and sources of music are incredibly diverse. Musicmay be monophonic or polyphonic. It may be representedas digital audio, (digitized) analog audio (for example fromold scratchy record or hissy tape collections), conventionalmusic notation in symbolic/computer-readable format, con-ventional music notation as scanned images (sheet music),and event-level music such as MIDI, to name a few. Onemay have access to a full piece of music, or only to a snip-pet, such as a chorus or an incipit. Pieces of music may occurin the same key, or they might exist in numerous differentkeys. Pieces might be played or otherwise represented in awide variety of tempos, or they might all be normalized toa single tempo. Depending on the source of the piece, theremight be different types of errors in the final representation:users humming a piece will produce one type of error, auto-mated transcriptions of audio could produce another type oferror, and automated transcriptions of digitized sheet musiccould produce yet another type of error.

The combinatorial possibilities of these music sources areenormous. One such combination might be “incipits of poly-phonic, MIDI music, normalized to C-Major but left in theiroriginal tempos,” Another combination might be “mono-phonic full tunes, in audio format, unnormalized in any way,

1This work was supported in part by the Center for Intelli-gent Information Retrieval and in part by NSF grant #IIS-9905842. Any opinions, findings and conclusions or recom-mendations expressed in this material are the author(s) anddo not necessarily reflect those of the sponsor.

Permissionto make digital or hardcopiesof all or part of this work forpersonalor classroomuseis grantedwithout fee provided that copiesarenot madeor distributedfor profit or commercialadvantageandthatcopiesbearthisnoticeandthefull citationon thefirst page.To copy otherwise,torepublish,to postonserversor to redistributeto lists,requiresprior specificpermissionand/ora fee.

hummed by first year university music students”. A roughcombinatorial enumeration yields a total of 240 different va-rieties or types of music chunks. If we add to that the factthat the query could be of a different type than the mu-sic source collection (for example, the query might be audioand monophonic, while the collection has only sheet music,polyphonic pieces), then there are

�240

2 � = 28, 680 possibleexperimental configurations.

Clearly the music retrieval community does not have theresources to build even a couple dozen different test collec-tions, much less twenty-eight thousand. Nevertheless, thevaried types of systems being built by the community con-tinue to proliferate. Systems are being built which work onlyfor monophonic music, or only polyphonic music, or only au-dio music, or only key-transposed music, and so on. Whilethese are necessary first stages in such a new research com-munity, the goal should be to produce retrieval algorithmswhich are robust to the various music sources and repre-sentations. Otherwise, the community risks balkanizationof the retrieval process and the creation of algorithms my-opic in scope and unable to function outside of their narrow,specialized situations.

2. BACKGROUNDIn this paper the focus is on ad hoc retrieval experiments.

There are certainly many other important MIR-related tasks,such as automated audio transcription, automatic clusteringand heirarchy creation for user browsing, and so on. For thesake of this discussion, we are focus on the ad hoc task, de-fined as new queries on a static (or nearly static) collectionof documents. The collection is known, a priori, but thequery which will be asked is not. The Cranfield model isthe standard evaluation paradigm for this sort of task andwas outlined in the 1960s by Cleverdon et al [1]. Along withmany others in the MIR community, we support this modelfor music information retrieval evaluation.

TREC (Text REtrieval Conference) expanded upon theseideas, providing a centralized forum in which large, stan-dardized collections could be assembled [4]. Rueger [6], So-dring [7], and Voorhees [8] have spent a good deal of effortexplaining how TREC-style experiments may be applied tothe MIR context. We wish to synthesize the lessons learnedin these discussions with additional requirements suggestedby other researchers.

In particular, Futrelle explained that “The goal of anybasic research in MIR should be to develop techniques thatcan fit into a broad and comprehensive set of techniques. An

110

evaluation of an MIR technique should situate itself in thislarger context, and should acknowledge the implications theresults have for the technique’s role in a broader and morecomprehensive set of techniques.[3]” We wish to integratethis goal with the advantages of TREC-style evaluation byemphasizing the notion of a track, as we will explain in sec-tion 3.

Perhaps the most similar proposal to ours is the Melucciand Orio task-oriented approach [5]. In their paper, theypropose identifying and separating queries into separate tasksby their “information requirements”, or the broad categoryby which similar pieces of music will be found. For exam-ple, a certain query might be identified as a “melody” query,meaning that relevant documents will be melodically similar.Or another query might be identified as a “rhythm” query,meaning that relevant documents are going to be rhythmi-cally similar. This is a very important distinction to make,as the same exact piece of music may be used as a query,but depending on a user’s infomation requirement (need) adifferent algorithm will have to be built. At the same time,systems should not be so specific that you need a differenttype of system for every single subspecies of information re-quirement. For example, there should not have to be oneinformation retrieval system for jazz melodies, another forclassical melodies, and yet another for folk melodies. Know-ing the broad information requirement, the fact that melodicsimilarity is desired, should be enough. Otherwise, balka-nization increases as too many narrowly defined retrievalsystems proliferate. There is a balance between homogene-ity and variety that must be struck.

3. TRACKS AND TOPICSWe feel that striking this balance between homogeneity

and variety is important, and we wish to carry it a step fur-ther. By so doing, we also believe we will meet the Futrellegoal of being able to develop techniques which fit into abroader context of music information retrieval research. Themanner in which we propose balancing homogeneity and va-riety this is to divide test collections into TREC-style tracksand topics.

A track is a broad statement about the type of task whichwill be done. A topic is an individual query, an expressionof a user information need and other supporting informa-tion. A single track contains multiple topics, multiple varie-gated types of information need expressions. In this sense,a track is no different from the Melucci task in the previoussection. Homogeneity is achieved in the sense that all thetopics within a track have the same basic information needas their foundation. Variety is achieved in the sense thatmany topics within a track are slightly different expressionsof that type of information need.

Returning to the example above, a “melodic” track is ho-mogenous because all the topics within that track have astheir core need melodic similarity. A melodic track simulta-neously has variety because there are not only folk melodyqueries, but jazz melody queries and classical melody queriesas well. Furthermore, tracks allow us to meet the Futrellerequirement that algorithms developed for searching be asbroadly applicable as possible, because in order to score wellacross all the topics in a track, a retrieval system developercannot optimize only toward jazz queries, or only towardclassical queries. In order to perform well on the task, morepowerful, more broadly applicable algorithms will be devel-

oped. “The ideal MIR technique could be effectively appliedto a wide variety of music, regardless of its cultural origin[3].”

Up to this point, our proposals are in alignment with mostof the other white papers detailing TREC-style evaluation.Again, our goal is not to replace these ideas, but to expandon them. The main idea of this paper is simply to carry thenotion of tracks and topics one step further, into the realmof representation and complexity. In section 1 we spoke ofthe huge number of combinatorial possibilities that arosewhen systems were built and specifically tailored only to-ward monophonic music, polyphonic music, symbolic music,audio music, full pieces, incipits, choruses, and so on. Yetit should not matter if the music is in symbolic format, orscanned sheet music format, or audio, or if it is monophonicor polyphonic. In all cases, a user with the information needof finding pieces of music with the same tune as his querywill have that need met no matter what the format of theretrieved piece.

Therefore, we wish to expand the “melody” track to in-clude topics (and source collections) which contain not onlyjazz, classical, and folk pieces, but which also contain mono-phonic, polyphonic, audio, and symbolic pieces. The top-ics should also contain “full-text” pieces as well as incipit-only pieces, and chorus-only pieces. In short, homogeneityis preserved because all of the information needs expressedare thematically equivalent; users want pieces of music thatcontain the same “tune” as their query, no matter the formof the query or the source collection. At the same time,variety is also preserved because all the topics are slightlydifferent in not only genre, but representation and complex-ity. It therefore becomes a worthy research goal to findalgorithms which can deal with melodic similarity across allthese boundaries.

4. TRACK SELECTIONMelodic similarity, increasingly misnamed because polyphony

is an ingredient in the mixture, is only one possible track.Some users are not actually interested in melodic similarity,and thus the algorithms developed by systems using thistrack would not work. Tracks should work to maximizehomogeneity, to a point; when the information need of aparticular topic is too disparate from an existing track, adifferent track is needed.

I propose the following three major tracks for considera-tion in music information retrieval test collection construc-tion:

1. ‘‘Melody/Tune’’ Track – Contains topics in whichinformation needs (and thus relevance) is determinedprimarily from note pitch features. This does not meanthat other features, such as duration and timbre toname just a few, cannot be used to aid the retrievalprocess. Indeed, durations of notes might better in-form some sort of rhythmic structure, which could beused in determining melodic boundaries or significantchanges. Timbral features in an audio piece might of-fer clues about which notes or chords are or are notpart of a “tune”. But the point is that, no matterwhat features are used, the similarity sought by thistype of information need related to the “tune” of thepiece in question.

111

2. ‘‘Rhythm’’ Track – Contains topics in which infor-mation needs (and thus relevance) is determined pri-marily from note onset and duration features. Again,this does not mean that other features are unimpor-tant. Suppose someone lays down a salsa beat, as aquery, and the goal is to find other songs with a similarrhythm. Then being able to determine the timbre ofthe high-pitched clave, and using that timbre to de-termine when this instrument is struck, might give agood indication of where the main or important beatsin a particular piece of music lie, thus better educat-ing a rhtymic similarity matching algorithm. Or, ifyou saw in some symbolic piece of music that the notepitches returned to the tonic at some regular interval,that might help better identify measure boundaries orphrase/passage boundaries, which also could be use-ful for creating better rhythmic mathing algorithms.Once again, no matter what features are used, the sim-ilarity algorithms associated with this track, with thistype of information need, relate to rhythmic patterns.

3. ‘‘Genre’’ track – Contains topics in which informa-tion needs (and thus relevance) is determined primarilyfrom human-based genric judgements. This might bethe hardest track to define, as genres include every-thing from heavy metal/country/rap in the popularaudio domain, to mazurkas and cha-chas in the dancedomain, to distinctions such as baroque, classical, andromantic in a period-based domain. Features usedcould include anything: pitch, harmony, duration, tim-bre, rhythm, and so on. A cha-cha might be similar toother cha-chas because of rhythmic clues, a baroquepiece might be similar to another baroque piece be-cause of certain harmonic progression clues (not theactual harmonic progressions, but the patterns inher-ent in those progressions), and a country song might besimilar to another country song because of certain tim-bral clues (such as that characteristic “twang”). But inall cases, topics in this track have as their informationneed a similarity of genric type.

As with the tune track, the rhythmic and genre tracks alsocontain music of sources of representation and complexity:monophonic, polyphonic, audio, symbolic, full-text, incipitsonly, and so on. As such, homogeneity is best preservedacross tracks, while variety is expanded within a track.

These are not the only possible tracks, nor do I feel thatthese existing proposals are set in stone. The communitymight feel that the “tune” track is too broad, too homo-geneizing, and that, for the time being, there should be bothan audio tune track, and a symbolic tune track. Whateverthe final decisions, however, we would like to reemphasizethe notion of having only a small number of tracks, and alarge number of topics within a track. If there are too manytracks, the community risks balkanization. If there are toofew topics within a track, the statistical significance of re-trieval evaluation and system comparison will be low. Moretracks may be added in a few years, as community interestand size continues to grow. But in these beginning stages,a small number of tracks is preferable.

5. THE ROLE OF MUSICGRIDOne more piece is needed to make the proposals in this

paper possible. Dovey has recently proposed a WebServices-related framework for distributed MIR collaboration andevaluation: MusicGrid [2]. This architecture allows a com-munity to share not only resources such as topics (queries)and source collections, but also algorithms which operate onthis data. Not only can these algorithms be migrated to thedata, rather than the other way around, but componentsmay be pieced together like a puzzle, mixed and matched.

This has important consequences for the track and topicbased evaluation we propose in this paper. In particular, oneof the difficulties associated with collections of multifariousmusic, from monophonic to polyphonic, from audio to sym-bolic, from jazz to classical, is that not every research groupin the community has the expertise, let alone the resources,to work with every type of music and representation.

Thus, if I am trying to work with some sort of pitch-based feature, and the data in the collection is piecewiseaudio, I will have to write my own transcription algorithmbefore I can even begin to examine those pieces. This can beprohibitively expensive, and leads many research groups tofocus only on symbolic data. Yet with a GRID architecture,if one member of the research community has implementeda transcription algorithm, no matter how good or bad, thatalgorithm may be taken and plugged in to someone else’ssystem as a front end.

As long as a “parser” exists for a particular music format,there is no need to develop music collection and/or queriesin a standardized format. Research groups may bring col-lections of music, whether 50 pieces or 10,000 pieces, to thecommunity, and as long as they also provide a parser whichcan read and “take apart” data in their format, the datawill be accessible to all within the community. Thus moreresearchers can get up to speed quicker, designing algorithmswhich do better matching, rather than spending their timetrying to parse various formats.

Therefore, with a MusicGrid architecture, a large numberof topics may quickly be assembled, which topics may betested against a large collection of music. For a given track,research groups need only submit a set of music pieces (thebackground collection), a set of topics (queries) which areintended to be run on this collection, and a parser whichhandles the data format of this collection. MusicGrid letsus simply take the union of all these topics and music toform a larger test collection for everyone to share.

Suppose I am building a retrieval algorithm which usesa pitch-based feature in some manner. Now, suppose tworesearch groups provide access to their collections, and itturns out that the same piece of music is found in bothcollections. However, in the first collection, this piece ofmusic exists in symbolic format, and in the second collec-tion it is audio. The sequence of pitches gleaned from thesymbolic parser on the symbolic piece will undoubtedly beslightly different than the sequence gleaned from the audioparser/transcriber on the audio piece. This is actually thewhole point of amalgamating symbolic and audio pieces intothe same track; the algorithms that will need to be devel-oped to function on both perfect symbolic data as well asimperfect transcribed data should yield better insights intothe nature of the problem than algorithms specifically tai-lored to a particular representation.

112

6. CONCLUSIONEvaluation drives research. Benchmarks help define re-

search goals. Possession of a valid evaluation metric al-lows researchers to develop techniques which push the enve-lope of existing technologies and successfully meet the taskat hand. By dividing music information retrieval evalua-tion into tracks and topics, we insure that the techniqueswhich will be developed in the future are sufficiently broadand powerful enough to handle a variety of different mu-sic sources, representations, and complexities, while at thesame time are focused enough to meet a user’s informationneed.

By employing the MusicGrid architecture in support ofthis evaluation paradigm, it will become much easier tobootstrap large, varied test collections together. Not onlydo larger collection and topic sets increase the communi-ties confidence in the results of an evaluation metric, butthe very manner in which the test collections are assembledhelps prevent the balkanization of algorithms that mightotherwise occur. Furthermore, this same architecture lets re-search groups, who otherwise would not have the resources,participate in the algorithm-crafting arena.

The tracks proposed in this paper are not set in stone.Further discussion is necessary to agree within the commu-nity which tasks are the most interesting, the most widelyapplicable. But whatever the outcome of such discussions,the very process of spanning together numerous topics withthe same core information need, no matter what the repre-sentation format or music piece length or complexity, willhelp create robust and powerful music information retrievalsystems.

7. REFERENCES[1] C. W. Cleverdon, J. Mills, and M. Keen. Factors

Determining the Performance of Indexing Systems,Volume I - Design, Volume II - Test Results. ASLIBCranfield Project, Cranfield, 1966.

[2] M. J. Dovey. Music grid – a collaborative virtualorganization for music information retrievalcollaboration and evaluation. In J. S. Downie, editor,The MIR/MDL Evaluation Project White PaperCollection (Edition #2), pages 50–52,http://music-ir.org/evaluation/wp.html, 2002.

[3] J. Futrelle. Three criteria for the evaluation of musicinformation retrieval techniques against collections ofmusical material. In J. S. Downie, editor, TheMIR/MDL Evaluation Project White Paper Collection(Edition #2), pages 20–22,http://music-ir.org/evaluation/wp.html, 2002.

[4] D. Harman. The trec conferences. In R. Kuhlen andM. Rittberger, editors, Hypertext - InformationRetrieval - Multimedia; Synergieeffekte ElektronischerInformationssysteme, Proceedings of HIM ’95, pages9–28. Universitaetsforlag Konstanz, 1995.

[5] M. Melucci and N. Orio. A task-oriented approach forthe development of a test collection for musicinformation retrieval. In J. S. Downie, editor, TheMIR/MDL Evaluation Project White Paper Collection(Edition #2), pages 29–31,http://music-ir.org/evaluation/wp.html, 2002.

[6] S. Rueger. A framework for the evaluation ofcontent-based music information retrieval using the trec

paradigm. In J. S. Downie, editor, The MIR/MDLEvaluation Project White Paper Collection (Edition#2), pages 68–70,http://music-ir.org/evaluation/wp.html, 2002.

[7] T. Sodring and A. F. Smeaton. Evaluating a musicinformation retrieval system, trec style. In J. S.Downie, editor, The MIR/MDL Evaluation ProjectWhite Paper Collection (Edition #2), pages 71–78,http://music-ir.org/evaluation/wp.html, 2002.

[8] E. M. Voorhees. Whither music ir evaluationinfrastructure: Lessons to be learned from trec. In J. S.Downie, editor, The MIR/MDL Evaluation ProjectWhite Paper Collection (Edition #2), pages 7–13,http://music-ir.org/evaluation/wp.html, 2002.

113

MIR Benchmarking: Lessons Learned from the Multimedia Community

Josh Reiss

Department of Electronic Engineering Queen Mary, University of London

Mile End Road, London E1 4NS UK

+44-207-882-5528

[email protected]

Mark Sandler Department of Electronic Engineering Queen Mary,

University of London Mile End Road,

London E1 4NS UK

+44-207-882-7680

[email protected]

ABSTRACT

Music Information Retrieval may be perceived as part of the larger Multimedia Information Retrieval research area. However, many researchers in Music Information Retrieval are unaware that the problems they deal with have analogous problems in image and video retrieval. Many issues concerning the creation of testbed digital libraries and effective benchmarking of information retrieval systems are common to all multimedia retrieval systems. We examine the approaches used in the image and video communities and show how they are applicable to testbed creation and information retrieval system evaluation when the media is music.

1. INTRODUCTION In recent years, the Music Information Retrieval(MIR) community has been concerned with issues concerning the creation of Music Digital Libraries and the benchmarking and evaluation of MIR systems. For the most part, they have been working alone on these issues. Notable exceptions to this come from the Library and Information Science community, such as the liasing with TREC researchers on ideas for benchmarking, and inclusion of MDLs into larger digital libraries, e.g., incorporation of MelDex into the New Zealand Digital Library[1]. Still, the MIR community have encountered few researchers who deal with related challenges of a depth and complexity similar to their own.

This isolation is unnecessary. Both the Video Information Retrieval (VIR) and Image Information Retrieval (IIR) communities have struggled with the difficulties in creating large digital libraries, designed appropriately for IR evaluation and free from copyright issues. VIR is now a track at TREC, and so they have a large amount of experience in evaluating their retrieval systems. The IIR community have created their own methods for benchmarking and evaluation, borrowing some ideas from

TREC and creating novel approaches for content-based retrieval of multimedia.

In addition, many within the Multimedia Information Retrieval (MMIR) community have identified problems that may exist in MIR evaluation of which MIR researchers are not yet aware. In IIR, there was a need to automate and streamline the evaluation process. Furthermore, they found that the lack of a common access method, analogous to SQL for relational databases, was a hindrance to providing consistency in evaluation. Thus they devised their own solutions to these problems.

VIR researchers have encountered the more abstract problems associated with the vagueness of relevance definitions for multimedia. Similarity between documents when they each have a time dependent component is a complicated issue. The choice of metadata, segmentation and the subjective evaluation of precision and recall are all issues of concern to both MIR and VIR researchers. Furthermore, video researchers have explored different approaches from those typically used with musical data.

In this paper, we study the approaches of the MMIR community and see where these approaches are applicable to MDL creation and MIR evaluation and benchmarking. From this, we propose a set of recommendations and guidelines for the MIR community. These guidelines have the benefit of requiring only small modifications from the guidelines that have been tested and streamlined for video and image problems. Finally, we suggest how the research in MIR may be augmented and used within a full Multimedia Information Retrieval System that incorporates, images, video and audio.

This document is divided into three sections. The first contains a discussion of the experiences and suggestions concerning creation of digital libraries for multimedia. The second section concerns methods and implementations of benchmarking and evaluation of

114

kmedina

Permission to make digital or hard coies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.

kmedina

MMIR systems. Finally, in the third section, we discuss how to incorporate musical queries into a full multimedia system.

2. COMMENTS ON MDL CREATION

2.1 The Open Video Project One of the more interesting projects in video that could be mirrored by the MIR/MDL community is the Open Video Project[2]. Anticipating a future with widespread access to large digital libraries of video, a great deal of research has focused on methods of browsing and retrieving digital video, developing algorithms for creating surrogates for video content, and creating interfaces that display result sets from multimedia queries. Research in these areas has required that each investigator acquire and digitize video for their studies since the multimedia information retrieval community does not yet have a standard collection of video to be used for research purposes. The primary goal of the Open Video Project[5 is to create and maintain a shared digital video repository and test collection to meet these research needs.

The Open Video Project aims to collect and make available video content for the information retrieval, digital library, and digital video research communities]. Researchers can use the video to study a wide range of problems, such as testing algorithms for feature extraction and the creation of metadata; or creating and evaluating interfaces that display result sets from multimedia queries. The idea is to collect video that is in the public domain, or provided by owners who grant permission to use their intellectual property for research purposes, and make that video available in a variety of standard formats, including streaming, along with a set of accompanying metadata. Because researchers attempting to solve similar problems will have access to the same video content, the repository is also intended to be used as a test collection that will enable systems to be compared, similar to the way the TREC conferences are used for text retrieval.

This repository is hosted as one of the first channels of the Internet 2 Distributed Storage Infrastructure Initiative[3], a project that supports distributed repository hosting for research and education in the Internet 2 community.

2.1.1 Project Overview The Open Video Project began in 1998 with the development of a basic framework and the digitization of the initial content. Additional video was contributed by various other projects. The first stage also included entering metadata for each segment into a database, and creating a Web site to enable researchers to access the available video.

The next stage of the project involves adding additional video segments to the repository, and expanding both the available formats and genre characterics (news, entertainment, and home videos) of the video. As the size of the repository is expanded in this stage of the project, the database schema is extended to incorporate more metadata fields. Further work concerns creating

innovative interfaces to the video repository that enable users to more easily search, browse, preview, and evaluate the video in the collection.

2.1.2 Copyright Issues The Open Video repository provides video clips from a variety of sources, especially various video programs obtained from U.S. government agencies such as the U.S. Records and Archives Administration and NASA. Although the government agency videos were produced with public funds and are freely available from the Archives, no copyright clearance has been obtained for audio or video elements in these productions. They encourage researchers to use the data under fair use for research purposes. Those wishing to use these video clips in any commercial enterprise must bear the burden of obtaining copyright clearances.

This inevitably will create problems since the fair use clause may differ from country to country, and sometimes includes additional restrictions for international use outside the country of origin. Furthermore, it is unclear how liable the creators of the library may be if the copyright is violated by an MDL user or anyone else. Recording companies may want a stronger guarantee than the pledge that the material will just be used for research purposes. Thus, an Open Audio Project, analogous to the Open Video Project, is not sufficient for MDL creation.

Two proposals from within the MIR community may be viable alternatives. The first[4] is similar to the Open Video Project, but would include no copyrighted or restricted material. Instead, material would be contributed by the MIR community and amateur musicians. The idea was suggested for use specifically with MIDI, but could easily be expanded to other formats.

The second proposal[5] offers the ability to use copyrighted material while still ensuring that it is not misused. Here, the testbed is kept secure, and it may be accessed such that the analysis, processing and data mining is all performed within the secure area, and no restricted material is released.

2.2 Lessons learned from the IIR community concerning multimedia database collection The image information retrieval community suffer from many of the same problems concerning testbesds and data collection that the MIR community does. The most commonly used images in IIR come from the Corel Photo CDs; a collection of copyrighted, commercially available images. Most research groups can only purchase a small subset of the collection. Since each CD contains a set of similar images, most testbeds contain several dissimilar image groups. This leads to overoptimistic benchmarking of IIR systems.

An alternative collection is the images from MPEG-7, which has the benefit of being used in an official standard. Unfortunately, the MPEG-7 collection is also expensive and may not be used on the web.

115

MRML

MultimediaRetrievalSystems

BenchmarkServer

SystemKnowledge

PerformanceMeasures

MultimediaDatabase

JudgmentGroups

RelevantMedia

RelevanceJudgments

Figure 1. Flowchart depicting an automated benchmark for MMIR systems (adapted from [6]).

Thus, many in the IIR community have resorted to developing their own collections, much the same as is done in the MIR community. The Annotated Groundtruth Database[7] is one such system. This is a freely available, uncopyrighted collection of annotated photographs from different regions and about different topics. This is still a small collection (approximately 1,000 images), and it has been suggested that this collection be used as the starting point for a larger testbed. However, as it stands there is no commonly accepted uncopyrighted testbed.

The MIR community finds itself in the same position with respect to fragmentation and a lack of standards with digital libraries. In regards to copyright issues, it is in an even more restricted position since the music that researchers are most interested in is almost all copyrighted and the rights are heavily guarded.

2.3 The World Wide Web as a testbed The internet allows individual users to store audio files and file indexing on computers distributed throughout the world. As such, it represents a large-scale, distributed audio testbed, ideal for a web-based music information retrieval system. The issues involved in the creation of web-based multimedia information retrieval systems have been explored thoroughly in the context of mixed media consisting of images, text and video, but have yet to be applied to audio.

In [8], an integrated visual retrieval system, supporting global visual query access to multimedia databases on the web, was described. This system required a central server and media stored in databases located at many sites accessible through the web. As such, it assumes a centrally administered repository with a Napster-like model. Such a model is useful in unifying independent testbeds run by different research centers.

Other web-based IIR systems, such as WebMARS[9], WebSeek[10](which is also a VIR system) and ImageRover[11] are multimedia search engines that incorporate content-based image information retrieval as well as text -based retrieval. They all have the benefit of accessing image files anywhere on the web, i.e., the largest possible testbed. However, all these systems require indexing of images and for the features to be stored in order for the queries to be answered quickly. Images may be removed but still appear in the index. They allow searching of various file sizes and formats, and searching for different features (different types of queries). Yet they are also quite different from each other, and use different sets of features and incorporate metadata in differing ways.

3. COMMENTS ON BENCHMARKING AND EVALUATION

3.1 Competitive benchmarking The IIR comunity have established the Benchathlon(www.benchathlon.net), which uses a competitive approach to IR evaluation. The systems under evaluation are contestants, and follow strict guidelines in order to enter the competition. The benefits of such an approach are that it can be achieved when the ratio of the number of judges to the number of entrants is low, it spurs entrants to improve their systems by inspiring the competitive spirit, and it allows analysis of the various systems through direct comparison.

However, such competition, even when suggested only partly in jest as with the benchathlon, opposes the spirit and ethos of the methods used by TREC. One of the primary tenets employed by TREC is that their evaluation system is not meant as a contest.

TREC, although not using test subjects in real world environments, still uses extensive subjective testing and evaluation. To duplicate such a deep test would prove extremely difficult for small research groups with low financial resources. Thus automatic benchmarking has been proposed.

3.2 Automated benchmarking One of the difficulties in MMIR benchmarking has been the lack of a common access method to retrieval systems. For instance, SQL or Structured Query Language, is a relational database query language that has been adopted as an industry standard. No such equivalent is available for content based multimedia retrieval systems. Thus, the image information retrieval community have proposed the Multimedia Retrieval Markup Language(MRML) [12]. Although proposed in the context of IIR, it could be applied to other media and thus would assist in standardising access to MIR systems.

116

BenchmarkSuite

NaturalImages

SyntheticImages

Photos ... MedicalImages

Cartoons ... Trademarks

Object-basedRetrieval...

... Shape-based Retrieval

BenchmarkSuite

AudioRepresentation

SymbolicRepresentation

Monophonic Polyphonic MIDI ...

Feature-basedRetrieval

... MelodicContours

Transcription-based Retrieval

Common MusicNotation

(a) (b)

Figure 2. (a) An Extensible Image Benchmark Framework adapted from [13], and (b)

an analogous musical benchmark framework.

The biggest problem in automatically benchmarking IIR systems is the lack of a common access method. The advent of MRML has solved this problem. MRML standardizes IIR access. It allows a client to log onto a database and ask for the available image collections as well as to select a certain similarity measure, and to perform queries using positive and negative examples. With such a communication protocol the automated evaluation of IIR systems is possible.

As depicted in Figure 1, MRML serves as the communication layer between the evaluated systems and the benchmark server. The multimedia digital library and the performance measures are known to all the systems. Relevance judgments may be known for initial testing, but for proper and fair evaluation, should not be known by the MMIR systems.

Of importance to the MIR community is the fact that MRML was designed to be both highly extensible and gracefully degradable. This means that it could also be extended to MIR systems. It allows for support of other media, and commands intended for images could be easily ignored by an MIR system.

Automated benchmarking can then be achieved by using MRML as the common access method to enable different MIR systems to answer the same queries, using the same testbed, and receive immediate relevance judgments.

3.3 Creation of an Evaluation Framework

As has been pointed out previously, MIR benchmarking has the problem of being a somewhat vague problem because different MIR systems are built to answer different queries and often deal with very different representations of digital music.[14,15] The video community, due to the complexity of its data, has been concerned with more low level tasks, such as segmentation and feature extraction[16]. IIR researchers, however, have carefully classified the information retrieval tasks they deal with and the

related data types that are used[13]. This classification, breakdown and analysis of Information Retrieval tasks can easily crossover into MIR.

An example of this is given in Figure 2. It depicts how a benchmark suite would consist of parts that should not be fully complete and unchangeable. Indeed, with few exceptions, similarity measures for symbolic representations of music are very different from those for raw audio representations. This classification scheme, together with the work on testbeds, database access, and automated and competitive benchmarking, has led to a full evaluation framework for content-based IIR[17].

Recent work has presented a wide variety of ideas and early-stage research in MIR benchmarking and evaluation[5,18]. However, this work is yet to be formalized to the extent that it has been in the IIR community. Thus, MIR researchers should be aware that others in multimedia have devised a structure for evaluation that can deal with the variety and complexity of multimedia queries.

4. SUPPORTING MUSICAL QUERIES IN AN MMIR SYSTEM

In this section, we seek to address how best to incorporate MIR searches into a multimedia information retrieval system. We consider the complex interplay that can arise from searching across media, and the special types of issues that this creates. Lastly, we show how the creation and indexing of metadocuments can lead to an effective large-scale MIR system with the ability to retrieve multimedia documents.

117

Table 1. A list of hypothetical queries which use different media for the queried document, retrieved documents and/or indexing system. In each of the presented queries, musical content is in some way processed or retrieved.

Query Input Output Intermediate

Find video performances of music that sounds like… Audio Video Audio

Find the transcription of… Audio Image Symbolic

Find all documents in the database related to the Beatles. Text Multimedia Metadocument

Here is an album cover. Find music from that album. Image Audio Metadocument

What does this sheet music sound like? Image Audio Symbolic

Show me highlights of this sports event.. Video/Text Video Audio

Find the album version of the song in this video. Video Audio Audio

What is the soundtrack of this movie? Video Audio Audio

4.1 Cross-Media Queries Music Information Retrieval, although having unique problems and involving an interesting and unusual mix of interdisciplinary challenges, may be classified within the general subject of multimedia information retrieval (MMIR). In a rigorous study of multimedia queries on the WWW[19], analysis of over a million queries to text based search engines suggested that, as lower estimates, approximately 3.39% of them were image or video related, whereas only 0.37% were audio related. Furthermore, the audio queries typically used more terms. This suggests that there may be a sizeable number of image, video, or text related queries where the preferred retrieved documents were music related, but not necessarily audio. This is given further credence given that lyrics was a popular audio related search term, and videos was a popular video related search term.

Even though it is difficult to determine the demand for cross-media based information retrieval systems, it is relatively easy to construct pertinent multimedia queries. Table 1 lists a variety of queries which use or retrieve music-related content. In each situation, multiple media types are used. The categorization is meant to be indicative as opposed to formal. The simplest examples involve music videos, which are often standard MIR-style audio queries with the exception that the corpus has a video as well as audio component. Other examples, such as “What does this sheet music sound like?” represent active research areas in the MIR community[20,21].

One of the most interesting examples, “Show me highlights of this sports event,” involves neither musical content in the query statement or in the retrieved documents. However, musical content can be an important feature in the audio stream. If critical moments in a sporting event are accompanied by music (such as a popular song played on the loudspeakers in baseball stadia after each run is scored), then identifying this in audio content will often prove much easier than identifying visual clues in the video stream. The use of audio identifiers to retrieve relevant video-based information, is well-known to researchers in [20]Video Information Retrieval[22-24], but the specific use of

musical content in the audio-visual stream represents a novel area of research.

4.2 Use of metadocuments in an MMIR Indexing Scheme An inherent problem with retrieving multiple media types in a feature-based index is that the appropriate features are different for each media. Image classification schemes often use the discrete cosine transform or the Gabor transform, whereas text feature extraction uses lexical analysis, and audio feature extraction often uses windowed harmonic content. Thus similarity indexing based on features would not enable retrieval of images related to audio, or vice versa. Such a problem necessitates the use of metadata as a means of hyperlinking related multimedia.

Through the combined use of metadata, metadocuments and hyperlinks, an effective cross-media indexing scheme may be devised. This has also been considered in IIR where the text metadata often associated with images on the web may be used for retrieval of web pages with relevant visual content[9].

For instance, an image may be entered as the query, features extracted and the closest match found. the related metadocuments for this image are then searched to find related media. thus, a scanned image of an album cover can be used to find songs off the album, lyrics, similar images, a video interview with the album producer, and so on. this has the additional benefit that it creates both an appropriate indexing and an appropriate browsing scheme for multimedia.

Furthermore, the system need be no more complex than the sum of its parts. All media retrieval can be performed using the same multidimensional feature set indexing and retrieval scheme (see [25] for details of an appropriate multidimensional search method). Then metadocuments can be searched using any appropriate keyword and text based scheme.

118

5. CONCLUSION In this work, we considered the approaches of the Multimedia Information Retrieval community to the problems of multimedia digital library creation and information retrieval benchmarking and evaluation. For all multimedia resources, there are problems concerning copyrights. Some researchers in the video community have taken the somewhat risky approach of assuming that all use will comply with regulations regarding the Fair Use of Copyrighted Materials. In IIR, several commercially available data sets are frequently used, thus there remains the problems of standardization and expense.

It therefore seems that the current proposals for MDL testbeds, secure copyrighted databases or contributed uncopyrighted testbeds, are preferable. A third approach, which is not specific to just images or video, is the use of the web as a multimedia corpus. This concept has obvious benefits because of its use of a sufficiently large and varied collection. However, this would allow for only certain types of MIR systems, i.e., web-based, and has additional issues concerning the ever-changing nature of the testbed, speed of access, and lack of metadata.

In regards to benchmarking and evaluation, both video and imaging researchers have made great strides. The video community has been very active in TREC, and thus can serve as a guide to how TREC can assist in MMIR evaluation, and also of how MIR benchmarking could mirror the TREC approach if the music retrieval community chose not to participate in TREC.

In image retrieval, they have considered an approach to benchmarking that is distinctly different from the TREC approach. First, they have referred to the evaluation as a contest, whereas TREC emphasizes its noncompetitive nature. Furthermore, the IIR researchers do not have the resources to manage large data collections, create and refine topic statements, pool individual results, judge retrieved documents, and evaluate results. Thus, they have automated the process through the use of MRML, a standardized access scheme for IIR systems. It is clear that the approach followed by IIR researchers deserves consideration by the MIR community. Certainly, those interested in having a common access method should consider the extension of MRML to musical queries.

Finally, we considered how MMIR systems may be linked to allow cross-media queries. Video, images and music retrieval systems all often use feature extraction in the summarization of content. Only metadata is required to link them. Given that research into retrieval is advancing throughout the multimedia community, and that many topic statements might require different media for the query, the retrieved documents or for intermediate stages, it is clear that the convergence of Multimedia Information Retrieval systems will be an active area of research in the future.

References

1 R.J. McNab, L.A. Smith, D. Bainbridge et al., The New Zealand Digital Library MELody inDEX, D-Lib Magazine May (1997).

2 G. Marchionini and G. Geisler, The Open Video Digital Library, D-Lib Magazine 8 (12) (2002).

3 M. Beck and T. Moore. The Internet2 Distributed Storage Infrastructure Project: An Architecture for Internet Content Channels. 3rd International WWW Caching Workshop, Manchester, England, June 15-17 1998.

4 J. A. Montalvo. A MIDI Track for Music IR Evaluation. JCDL Workshop on the Creation of Standardized Test Collections, Tasks, and Metrics for Music Information Retreival (MIR) and Music Digital Library (MDL) Evaluation, Portland, Oregon, July 18 2002.

5 J. S. Downie. Toward the Scientific Evaluation of Music Information Retrieval Systems. Fourth International Conference on Music Information Retrieval (ISMIR 2003), Washington, D.C., USA, October 26-30 2003.

6 H. Muller, M. Muller, S. Marchand-Maillet et al. Automated Benchmarking In Content-Based Image Retrieval. 2001 IEEE International Conference on Multimedia and Expo (ICME2001), Tokyo, Japan, August 2001.

7 "Annotated groundtruth database," Department of Computer Science and Engineering, University of Washington, 1999, www.cs.washington.edu/research/imagedatabase.

8 D. Murthy and A. Zhang. WebView: A Multimedia Database Resource Integration and Search System over Web. WebNet 97: World Conference of the WWW, Internet and Intranet, Toronto, Canada, October 1997.

9 M. Ortega-Binderberger, S. Mehrotra, K. Chakrabarti et al. WebMARS: A Multimedia Search Engine for Full Document Retrieval and Cross Media Browsing. Multimedia Information Systems Workshop, Chicago, IL, October 2000.

10 S.-F. Chang, Visually Searching the Web for Content, IEEE Multimedia Magazine 4 (3), 12 (1997).

11 S. Sclaroff, L. Taycher, and M. La Cascia. ImageRover: A Content-Based Image Browser for the World Wide Web. Workshop on Content-Based Access of Image and Video Libraries (CBAIVL '97), Puerto Rico, June 1997.

12 H. Muller, M. Muller, S. Marchand-Maillet et al. MRML: A communication protocol for content-based image retrieval. International Conference on Visual

119

Information Systems (Visual2000), Lyon, France, November 2–4 2000.

13 C. H. C. Leung and H. H. S. Ip. Benchmarking for Content-Based Visual Information Search. Advances in Visual Information Systems (Visual 2000), Lyon, France, November 2-4 2000.

14 J. D. Reiss and M. B. Sandler. Beyond Recall and Precision: A Full Framework for MIR System Evaluation. 3rd Annual International Symposium on Music Information Retrieval, Paris, France, October 17 2002.

15 J. D. Reiss and M. D.. Sandler. Benchmarking Music Information Retrieval Systems. JCDL Workshop on the Creation of Standardized Test Collections, Tasks, and Metrics for Music Information Retreival (MIR) and Music Digital Library (MDL) Evaluation, Portland, Oregon, July 18 2002.

16 A. F. Smeaton. An Overview of the TREC Video Track . 10th TREC Conference, Gaithersburg, Md, 13-16 November 2001.

17 H. Müller, W. Müller, S. Marchand-Maillet et al., A framework for benchmarking in visual information retrieval, International Journal on Multimedia Tools and Applications 21 (2), 55 (2003).

18 J. S. Downie. Panel on Music Information Retrieval Evaluation Frameworks. 3rd International Conference on Music Information Retrieval (ISMIR), Paris, France2002.

19 B. J. Jansen, A. Goodrum, and Spink. A., Searching for multimedia: video, audio, and image Web queries, World Wide Web Journal 3 (4), 249 (2000).

20 G. S. Choudhury, T. DiLauro, M. Droettboom et al. Optical music recognition system within a large-scale digitization project. First International Conference on Music Information Retrieval (ISMIR 2000), Plymouth, Massachusetts, October 23-25 2000.

21 J. R. McPherson. Introducing feedback into an optical music recognition system. Proceedings of the Third International Conference on Music Information Retrieval: ISMIR 20022002.

22 S.-F. Chang. Searching and Filtering of Audio-Visual Information: Technologies, Standards, and Applications. IEEE International Conference of Information Technology, Las Vegas, NV, March 2000.

23 H. Sundaram and S.-F. Chang. Video Scene Segmentation using Audio and Video Features. ICME 2000, New York, New York, July 28-Aug 2 2000.

24 N. V. Patel and I. K. Sethi. Audio characterization for video indexing. Proceedings of the SPIE on Storage and Retrieval for Image and Video Databases, San Jose, CA, February 1-2 1996.

25 J. D. Reiss, J.-J. Aucouturier, and M. B. Sandler. Efficient Multidimensional Searching Routines for Music Information Retrieval. 2nd Annual International Symposium on Music Information Retrieval, Bloomington, Indiana, USA, October 15-17 2001.

120

Appendix A: Candidate Music IR Test Collections

Donald Byrd School of Music

Indiana University +1-(812)-856-0129

[email protected]

Last revised 28 March 2003 By themselves, these collections might well be useful for a number of research projects. But with suitable queries and relevance judgments (if at all possible, human-produced), they would be much more useful, especially for cross-project comparisons via EMIR (tentative name of the proposed TREC-like "Evaluation of Music IR"). NB: appearance in this list means only that I think the collection exists in machine-readable form somewhere and might be available; there are serious copyright as well as other availability issues for most of these! Exception: collections whose name is prefixed with "* " don't even exist yet in electronic form, as far as I know. The RWC database is the one most clearly available (or soon to be available) for research purposes, but it's very, very small. The Uitdenbogerd & Zobel collection has the only existing set of human relevance judgments I know of, but the judgments are not at all extensive. This list is a continuing work-in-progress: I have very little time for it, but attempt to maintain it as a public service. In particular, I’m aware that the listings for CMN in image format-scanned scores and/or sheet music-are not very up-to-date. Comments are very welcome! Entries are in alphabetical order by name. Name Representation Encoding Complexity Approx size Comments

Bach Chorales Event MIDI (SMF) Polyphonic 400 Short 4-part contrapuntal pieces: 185 (from BG v.39) + c.200 (from elsewhere)

Barlow and Morgenstern

CMN ?? Monophonic 10,000 Themes of classical pieces plus "index"

CCARH MuseData

CMN MuseData Polyphonic 4000 Complete movements of 881 Classical pieces from c. 1680 to 1815, at least 1/3 Bach. Includes 185 Bach chorales. See www.musedata.org

CCARH MuseData

CMN kern Polyphonic 3000 Complete movements of c. 700 Classical pieces from c. 1680 to 1815, about half Bach. Includes 185 Bach chorales. (Based on the above.)

CD Sheet Music “CMN” Image Polyphonic 7000? Scanned scores. 48 CDs, each of c. 750-3800 pages of PDFs; total c. 86,400 pages

Classical Archives

Event MIDI (SMF) Polyphonic 17,000 Contributed MIDI files of public-domain works; quality uneven. See www.classicalarchives.com

Classical Archives

Audio Lossy (MP3, WMA)

Polyphonic 5200 Recordings of public-domain works.

*ECOLM CMN LTN Polyphonic 1000 Electronic Corpus of Lute Music: complete lute pieces

Harvard/MIT “CMN” Image Polyphonic?? ?? Scanned scores

continued on next page ...

121

kmedina

Permission to make digital or hard coies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.

kmedina

Name Representation Encoding Complexity Approx size Comments

Huron CMN Humdrum kern ?? 5000

HymnQuest CMN ?? ?? 13,000 Published by Stainer & Bell on CDROM (with monophonic content searching)

IRCAM Audio Most lossy (MP2), some lossless (CD)

Polyphonic 5000 contemporary music works, over 50% non-commercially available; all with metadata

JHU/Levy Sheet Music

“CMN”, CMN Image, some Guido

Polyphonic 29,000 scanned sheet music (c.100,000 pages, c. 80% public-domain)

L of C/Copyright Sheet Music

“CMN” Image Polyphonic 22,000 scanned sheet music ("American Sheet Music, 1870-1885": pieces copyrighted in those years)

L of C/Duke Sheet Music

“CMN” Image Polyphonic 3000 scanned sheet music ("Historic American Sheet Music, 1850-1920": pieces from the Duke collection)

Meldex (NZDML) Folksongs

CMN Meldex Monophonic 10,000 9400 German, Chinese, and Anglo-American folksongs from Schaffrath's collection and Digital Tradition. (Better would be *"MELDEX Plus": add back in the c. 200 containing tuplets NZDML removed.)

Mutopia CMN LilyPond(?) Polyphonic 300 Contributed classical works: www.mutopiaproject.org/index.html

Nightingale CMN Nightingale Polyphonic 600 movements and excerpts, mostly classical (same music as below)

Nightingale Event MEF Polyphonic 600 movements and excerpts, mostly classical (same music as above)

NZDML Fake book

CMN Meldex(?) Monophonic?? 1200 popular tunes

NZDML MidiMax

Event MIDI (SMF) Polyphonic?? 100,000 Standard MIDI files (collected from the Web?)

Parsons Pitch Contour Pitch Contour Monophonic 11,000 up/down/repeat encoding of pitches of themes of classical pieces

Pickens Event MIDI (SMF) Polyphonic 1200 Piano-only MIDI files, collected by hand from the web, in a variety of styles: Ragtime, classical, popular, original

RISM CMN Plaine &Easie ?? 250,000 Incipits from 188,000 works by end of 1995

RWC Event MIDI (SMF) Polyphonic 200 Pieces in four genres: 100 pop, 15 "royalty-free" (sic), 50 classical, 50 jazz; most original, all freely available for research (same as below)

RWC Audio Lossless (CD) Polyphonic 200 Pieces in four genres: 100 pop, 15 "royalty-free" (sic), 50 classical, 50 jazz; most original, all freely available for research (same as above)

Sunhawk CMN Proprietary Polyphonic 25,000 Many genres. See www.sunhawk.com

continued on next page ...

122

Name Representation Encoding Complexity Approx size Comments

Templeton “CMN” Image Polyphonic 22,000 scanned sheet music; at Mississippi State University

Themefinder CMN MuseData, kern?

Monophonic 37,000 incipits of classical pieces and folksongs. See www.themefinder.org

UCLA Pop. American

“CMN” Image Polyphonic?? 600 Pieces of popular American music

Uitdenbogerd & Zobel

Event MIDI (SMF) Polyphonic 10,500 MIDI files (collected from the Web). Some "real" relevance judgments exist.

Variations, Variations2

“CMN” Image Polyphonic ?? Scanned scores

Variations, Variations2

Audio Lossy (MP2, MP3)

Polyphonic ?? c. 9,000 hours (original Variations)

Other possibilities:

*?? Audio WAV, WMA, MP3

?? ?? Preferably encoded in a lossless format

*Web crawling Event MIDI (SMF) ?? ?? ??

123

Appendix B: ISMIR 2001 Resolution on MIR Evaluation

(taken from http://music-ir.org/mirbib2/resolution)

Support for the ISMIR 2001 resolution on the need to create standardized MIR test collections, tasks, and evaluation metrics for MIR research and development

Background: On 16 October 2001, at ISMIR 2001, Bloomington, IN, Drs. Don Byrd and J. Stephen Downie convened an informal session to discuss issues pertaining to the creation of standardized MIR test collections, retrieval tasks, and evaluation metrics. After vigorous debate, a general consensus was reached that the MIR research community must move forward on the creation of such collections, tasks, and metrics. At the behest of Dr. David Huron, Matthew Dovey drew up the following resolution for submission to the final plenary session of ISMIR 2001, 17 October 2001. Those present overwhelmingly endorsed the resolution.

It was further resolved that we encourage all those who have an interest in these matters to express their support for the resolution by allowing them to "sign" the resolution. This resolution and its appended signatures will be used to solicit from relevant funding bodies and rights holders the materials necessary to make the establishment of standardized MIR test collections, tasks, and evaluation metrics a reality.

If you have any questions or comments, please contact J. Stephen Downie: [email protected]

The Resolution : There is a current need for metrics to evaluate the performance and accuracy of the various approaches and algorithms being developed within the Music Information Retrieval research community. A key component for the development of such metrics would be a corpus of test data consisting of both audio and structured music data. Such a corpus would need to be readily available to the research community with international clearance of all relevant copyright and other intellectual property rights necessary to use this data solely for the purpose of music information retrieval research. . At the International Symposium on Music Information Retrieval, 2001, hosted by Indiana University, a resolution was passed endorsing support for activities which would lead to the establishment of such a corpus and hence evaluation metrics for this community.

97 Signatories as of 26 July 2003 Mical Beyene Graduate Student Manchester University Thu Jul 17 09:40:33 2003 Ewald Hesse Webmaster Georg Thieme Verlag, Stuttgart, Germany Sat Feb 1 13:51:25 2003 Mida Director Italia Centre for Music Documentation Tue Oct 22 10:27:23 2002 Alain Van Kerckhoven Director Belgian Centre for Music Documentation Mon Oct 14 08:19:28 2002 Dr.Gaël RICHARD E.N.S.T. (Télécom Paris) 46, rue Barrault

75013 Paris, France Wed Sep 25 08:46:56 2002

Emilia Gómez PhD Candidate Audiovisual Institute, Pompeu Fabra

University, Barcelona Thu Sep 5 12:57:00 2002 Christian Spevak Researcher University of Karlsruhe, Germany Thu Aug 1 11:44:45 2002 Pasang Thackchhoe Librarian Multicultural History Society of

Ontario,Canada Wed Jul 24 08:40:16 2002 James Kalbach Information Architect Razorfish Germany (Hamburg) Tue Jul 23 10:27:34 2002

Kasper Souren M.Sc. Mathematics Ircam - Centre Pompidou, Paris, France Mon Jul 15 06:19:27 2002 Margie Wiers Music Librarian Kinnison Music Library, Ohio Wesleyan

University Fri May 17 13:52:44 2002

Diemo Schwarz PhD candidate Ircam - Centre Pompidou, Paris, France Tue Mar 26 10:13:05 2002

Pedro Cano Vila Ph.D. Candidate Audiovisual Institute, Pompeu Fabra

University Mon Mar 4 04:41:11 2002

124

Yan Huang Engineer Panasonic Speech Technology Lab Sat Feb 9 01:39:44 2002

Hannah Frost Media Preservation Librarian Stanford University Tue Jan 29 12:34:32 2002

Michael Nelson Associate Professor Faculty of Information and Media Studies,

University of Western Ontario Thu Jan 3 12:31:37 2002

Yazhong Feng Ph.D student Department of Computer Science,Zhejiang

University,China Wed Dec 19 19:40:29 2001

Mr.Yohannes Abraha Research and Documentation Center,Erirea Mon Dec 17 11:03:05 2001

Jeremy Pickens Research Assistant Center for Intelligent Information Retrieval

Department of Computer Science University of Massachusetts Amherst

Mon Dec 10 12:15:49 2001

Anna Pienimäki Assistant Researcher Dept. of Computer Science, University of

Helsinki, Finland Tue Dec 4 02:53:16 2001

Jyh-Shing Roger Jang Associate Professor Computer Science Department, National Tsing

Hua University, Taiwan Sat Dec 1 08:51:54 2001

David Meredith Research Fellow City University, London Wed Nov 28 10:10:53 2001

Jiyoung Shim graduate student Yonsei University Tue Nov 27 07:36:34 2001

Pierre-Yves ROLLAND Assistant Professor of Computer Sc. and

Multimedia Tech. Universite d'Aix-Marseille 3, France Mon Nov 19 08:54:16 2001

Ichiro Fujinaga Professor Peabody Conservatory of Music Johns Hopkins

University Wed Nov 14 19:59:46 2001

Sayeed Choudhury Hodson Director of the Digital Knowledge

Center Johns Hopkins University Wed Nov 14 13:32:37 2001

Mr Mikael Fernstrom Interaction Design Centre, Univeristy of

Limerick, IRELAND Tue Nov 13 10:22:45 2001

Donncha S. Ó Maidín Head of Department of Computer Scienc,

University of Limerick, Limerick, Ireland Tue Nov 13 09:21:19 2001

Dirk Van Steelant Master Computer Science KERMIT - Ghent University, Belgium Tue Nov 13 02:13:08 2001

Kjell Lemström Researcher Dept. of Computer Science, University of

Helsinki, Finland Tue Nov 13 02:00:33 2001

John McPherson Ph.D Candidate University of Waikato Hamilton New Zealand Mon Nov 12 17:12:46 2001

Jane M. Subramanian Music Cataloger/Archivist SUNY Potsdam Potsdam, NY Mon Nov 12 13:37:24 2001

Fabien Gouyon PhD candidate Audiovisual Institute, Pompeu Fabra

University, Barcelona Mon Nov 12 08:28:41 2001

George Tzanetakis PhD candidata Computer Science Department, Princeton

University Sat Nov 10 12:48:43 2001

Jane Singer PhD candidate Hebrew University of Jerusalem (Israel) Fri Nov 9 06:52:18 2001

Micheline Lesaffre Musicologist, PhD. Candidate IPEM, dept. Musicology, Ghent University,

Belgium Fri Nov 9 04:42:51 2001

Taelman Johannes R&D Engineer Institute for psychoacoustics and electronic

music, Ghent University, Belgium Fri Nov 9 04:37:44 2001

Dirk Moelants Assistant IPEM-Dept. of Musicology Ghent University,

Belgium Fri Nov 9 04:36:28 2001

Mark Sandler Professor of Signal Processing Queen Mary University of London Wed Nov 7 03:09:27 2001

Min HUANG Associate Professor University Library, Shanghai Jiao Tong

University Tue Nov 6 19:16:05 2001

Craig Burket Senior Software Development Engineer Stellent, Inc. 62 Forest St., Suite 100,

Marlborough, MA 01752, USA Tue Nov 6 12:46:09 2001

Cheryl Martin Director of Bibliographic Services McMaster University, Hamilton, Ontario,

Canada Mon Nov 5 08:54:37 2001

Lenore Coral Music Librarian Sidney Cox Library of Music and Dance

Cornell University Mon Nov 5 08:07:45 2001

Dick Vestdijk librarian on private title, The netherlands Mon Nov 5 07:40:47 2001

Steven Blackburn Independent Researcher Formerly with the University of Southampton,

UK. Sat Nov 3 11:41:37 2001

John Vallier graduate student (info

studies/ethnomusicology) UCLA Sat Nov 3 09:40:08 2001

David Dorman Consultant Lincoln Trail Libraries System Champaign, IL Fri Nov 2 12:11:41 2001

Ronald Rousseau Professor KHBO, Dept. Industrial Sciences and

Technology, B-8400 Oostende Belgium Fri Nov 2 01:02:39 2001

125

Joyce McGrath Foundation member of IAMLANZ in 1968

now retired Formerly Art & Music Librarian at the Stat

Library of Victoria, Melbourne, Australia Thu Nov 1 17:34:25 2001

Pamela Thompson Chief Librarian Royal College of Music Prince Consort Road

London SW7 2BS UK Thu Nov 1 16:50:54 2001

James M Turner Professeur agrégé École de bibliothéconomie et des sciences de

l'information, Université de Montréal Thu Nov 1 15:23:31 2001

Pamela Juengling Music Librarian University of Massachusetts/Amherst Thu Nov 1 14:10:23 2001

Leslie Troutman Associate Professor of Library Administration,

Music Library University of Il linois at Urbana-Champaign Thu Nov 1 12:35:04 2001

Adriane Swalm Durey PhD Candidate School of Electrical and Computer

Engineering, Georgia Institute of Technology

Wed Oct 31 11:01:05 2001

Donald Byrd Senior Scholar Indiana University Wed Oct 31 08:07:15 2001

Aymeric Zils PhD candidate SONY Computer Science Labs, Paris, France. Wed Oct 31 03:31:23 2001

Jean-Julien Aucouturier Assistant Researcher SONY Computer Science Labs, Paris, France. Wed Oct 31 02:50:55 2001

Chaokun Wang Ph.D. candidate Dept. of Computer Science and Technology,

HIT(Harbin Institute of Technology), P.R.C. Tue Oct 30 19:31:33 2001

Nigel Nettheim Honorary Adjunct Fellow Macarthur Auditory Research Centre Sydney Tue Oct 30 15:21:14 2001

Timothy DiLauro Deputy Director, Digital Knowledge Center Johns Hopkins University Milton S.

Eisenhower Library Tue Oct 30 14:52:05 2001

Allen Renear Associate Professor, Graduate School of

Library and Information Science University of Il linois at Urbana-Champaign Tue Oct 30 14:25:31 2001

David Bainbridge Lecturer Dept. of Computer Science, University of

Waikato, New Zealand Tue Oct 30 14:20:20 2001

Matija Marolt Ph.D. candidate Faculty of Computer and Information Science,

University of Ljubljana, Slovenia Tue Oct 30 14:11:58 2001

Tim Crawford Project Manager, OMRAS Department of Music King's College Strand

LONDON WC2R 2LS United Kingdom Tue Oct 30 09:56:34 2001

Beth Logan Principal Research Scientist Cambridge Research Laboratory, Compaq

Computer Corporation Tue Oct 30 08:01:54 2001

David Nichols Visiting Lecturer in Digital Libraries Graduate School of Library and Information

Science, University of Illinois at Urbana-Champaign

Mon Oct 29 14:42:27 2001

Mary Wallace Davidson Head, William & Gayle Cook Music Library Indiana University, Bloomington, USA Mon Oct 29 11:03:48 2001

Elena Ferrari professor Università degli Studi dell 'Insubria, Como

(Italy) Mon Oct 29 09:31:48 2001

Dr Stefan Rueger Research Lecturer Department of Computing Imperial College,

London, UK Mon Oct 29 09:00:56 2001

Margaret Cahill Junior Lecturer University of Limerick, Ireland Mon Oct 29 07:51:50 2001

Seymour Shlien Research scientist Communications research centre Mon Oct 29 06:45:28 2001

Juan Pablo Bello Ph.D. candidate Queen Mary College, University of London Mon Oct 29 05:57:19 2001

Bozena Kostek Assoc. Prof. (Ph.D., D.Sc., Eng.) Sound & Vision Eng. Dept., Technical

University of Gdansk, Poland Mon Oct 29 04:25:30 2001

Dr. Xavier Serra Director Audiovisual Institute, Pompeu Fabra

University Mon Oct 29 03:55:00 2001

Inigo Barrera Ph. D. Candidate UPC (Polytechnical University of Catalonia),

Barcelona, Spain Mon Oct 29 02:37:49 2001

Shyamala Doraisamy MPhil/PhD candidate Imperial College, London, UK. Mon Oct 29 02:30:27 2001

Thomas Sødring Ph. D. Candidate Dublin City University Mon Oct 29 02:16:25 2001

Giuseppe Frazzini Information Technologist Teatro alla Scala, Milano Mon Oct 29 01:25:47 2001

Gaetan Martens Master Computer Science IPEM + Applied Mathematics And Computer

Science, Ghent University, Belgium Sun Oct 28 03:10:11 2001

Koen Tanghe R&D engineer IPEM (Institute for Psychoacoustics and

Electronic Music), Ghent University, Belgium

Sat Oct 27 15:35:55 2001

Michael Good CEO Recordare LLC Fri Oct 26 23:41:14 2001

Cheng Yang Ph.D. candidate Stanford University Fri Oct 26 18:19:50 2001

126

Jonathan Foote Sr. Research Scientist FX Palo Alto Laboratory, Inc. Fri Oct 26 12:49:18 2001

Les Gasser Associate Professor Graduate School of Library and Information

Science University of Il linois at Urbana/Champaign

Fri Oct 26 12:18:47 2001

P. Bryan Heidorn Assistant Professor Univiersity of I llinois, Graduate School of

Library and Information Science Fri Oct 26 12:14:13 2001

Sally Jo Cunningham Senior Lecturer University of Waikato Fri Oct 26 11:23:48 2001

Eric Isaacson Associate Professor of Music Theory Indiana University, Bloomington, USA Fri Oct 26 11:14:34 2001

Jay Kim

Ph. D student Rutgers University Fri Oct 26 11:13:25 2001

Matthew J. Dovey R&D Manager, Libraries Systems Oxford University Fri Oct 26 10:55:41 2001

Antony Gordon System administrator/cataloguer British Library National Sound Archive Fri Oct 26 10:54:02 2001

Emanuele Pollastri Engineer - PhD Student Università degli Studi di Milano Dipartimento

di Scienze dell 'Informazione Fri Oct 26 10:53:40 2001

Allan Smeaton Professor of Computing Dublin City University, Ireland Fri Oct 26 10:52:37 2001

Maurizio Longari MS - PhD Student Università degli Studi di Milano Dipartimento

di Scienze dell 'Informazione Fri Oct 26 11:08:18 2001

Jeffrey M. Vincour Undergraduate Cornell University, Ithaca, N.Y., USA Fri Oct 26 11:45:54 2001

Perry Roland Information Technologist University of Virginia Fri Oct 26 13:08:28 2001

J. Stephen Downie Assistant Professor Graduate School of Library and Information

Science, University of Illinois at Urbana-Champaign

Fri Oct 26 13:02:54 2001

127

The TREC-Like Evaluation of Music IR SystemsJ. Stephen Downie

Graduate School of Library and Information ScienceUniversity of Illinois at Urbana-Champaign

[email protected]

ABSTRACTThis poster reports upon the ongoing efforts being made toestablish TREC-like and other comprehensive evaluationparadigms within the Music IR (MIR) and Music Digital Library(MDL) research communities. The proposed research tasks arebased upon expert opinion garnered from members of theInformation Retrieval (IR), MDL and MIR communities withregard to the construction and implementation of scientificallyvalid evaluation frameworks.

Categories and Subject DescriptorsH.3.4 [Information Storage and Retrieval]: Systems andSoftware – performance evaluation

General TermsMeasurement, Performance, Human Factors

KeywordsTREC, Evaluation, Music Information Retrieval

1. INTRODUCTIONMIR is a multidisciplinary research endeavor that strives to

develop innovative content-based searching schemes, novelinterfaces, and evolving networked delivery mechanisms in aneffort to make the world’s vast store of music accessible to all..Good overviews of MIR’s interdisciplinary research areas can befound in [1-3].

1.1 Current Scientific ProblemNotwithstanding the promising technological advancements

being made by the various research teams, MIR research has beenplagued by one overarching difficulty: There has been no way forresearch teams to scientifically compare and contrast their variousapproaches. This is because there has existed:1. no standard collection of music against which each team

could test its techniques;2. no standardized sets of performance tasks; and,3. no standardized evaluation metrics.The MIR community has long recognized the need for a morerigorous and comprehensive evaluation paradigm. A formalresolution expressing this need was passed, 16 October 2001 bythe attendees of the Second International Symposium on MusicInformation Retrieval (ISMIR 2001). (See http://music-ir.org/mirbib2/resolution for the list of signatories.) [13].

Over a decade ago, the National Institute of Standards andTechnology developed a testing and evaluation paradigm for thetext retrieval community, called TREC (Text REtrievalConference; http://trec.nist.org/overview.html). Under thisparadigm, each text retrieval team is given access to:

1. a standardized, large-scale test collection of text;2. a standardized set of test queries; and,3. a standardized evaluation of the results each team generates.It is upon this TREC paradigm that we plan to create our “TREC-like” formal evaluation scenario. TREC-like is used deliberatelyas music retrieval presents some important differences to textretrieval and our evaluation tests must take these into account (seeMajor Research Questions below).

The two principal research streams being outlined are basedupon a synthesis and analysis of expert opinion garnered frommembers of the IR, MDL and MIR communities with regard tothe construction and implementation of scientifically validevaluation frameworks. As part of the Mellon-funded “MIR/MDLEvaluation Frameworks Project” (http://music-ir.org/evaluation),the outcomes of two fact-finding meetings form the foundationupon which this project is grounded. The presentations made ateach of the meetings have been collected in successive editions ofThe MIR/MDL Evaluation White Paper Collection. Seehttp://music-ir.org/evaluation for the most recent edition.

1.2 Major Research QuestionsThe project is informed by the following key research

questions:1. How do we adequately capture the complex nature of music

queries so proposed experiments and protocols are well-grounded in reality?

2. How do we develop new models and theories of “relevance”in the MIR context (i.e., What does “relevance” really meanin the MIR context?)?

3. How do we evaluate the utility, within the MIR context, ofalready-established evaluation metrics (e.g., precision andrecall, etc.)?

4. How do we integrate the evaluation of MIR systems with thelarger framework of IR evaluation (i.e., What aspects areheld in common and what are unique to MIR?)?

5. How do we continue the expansion of a comprehensivecollection of music materials to be used in evaluationexperiments?

2. STREAM #1: A TREC-LIKE TESTBEDThe author and colleagues have begun to construct the world’s

first-and-only, internationally-accessible, large-scale MIR testingand development database to be housed at the University ofIllinois’s, National Center for Supercomputing Applications(NCSA) (Fig. 1). Formal transfer and use agreements are beingfinalized with HNH Hong Kong International, Ltd.(http://www.naxos.com), the owner of the Naxos and Marco Polorecording labels. This generous gesture on the part of HNHrepresents approximately 30,000 audio tracks or about 3 terabytesof digital audio music information. All Media Guide(http://www.allmusic.com) has also agreed to follow HNH’s lead,enabling UIUC/NCSA to incorporate its vast database of musicmetadata within the same test collection.

COPYRIGHT IS HELD BY THE AUTHOR/OWNER(S).SIGIR’03, JULY 28–AUGUST 1, 2003, TORONTO, CANADA.ACM 1-58813-646-3/03/0007.

128

kmedina

Appendix C

kmedina

to be published in the Proceedings of SIGIR 2003.

2.1 System OverviewIt is important that the MIR testing and evaluation database be

constructed with three central features in mind:1. security for the property of the rights-holders, especially

important if we are to convince other rights-holders toparticipate in the future;

2. accessibility for both internal, domestic, and internationalresearchers; and,

3. sufficient computing and storage infrastructure to support thecomputationally- and data-intensive techniques beinginvestigated by the various research teams.

To these ends, we are exploiting the expertise and resources ofNCSA and its Automated Learning Group (ALG), headed byProf. Michael Welge. Using the ALG’s D2K technology as astarting point, we are creating a secure “Virtual Research Lab”(VRL) for each participating research team. These VRLs willprovide secure access to the test collection and the resources

necessary to conduct large-scale MIR evaluation experiments.Simply put, we enhance the security of the valuable music data bybringing the research teams to the collection, rather thandistributing the collection willy-nilly around the globe.

3. STREAM #2: HUMIRS1

We must ensure that the test tasks developed are realisticproxies for the kinds of uses that MIR/MDL systems might expectto encounter. Synthesizing from the suggestions made by theexpert participants, it appears that a minimal TREC-like queryrecord needs to include the following basic elements:

1. High quality audio representation(s)2. Verbose Metadata:

i. About the “user”ii. About the “need”iii. About the “use”

3. Symbolic representation(s) of the music presented

1 Pronounced, “hummers”.

One is struck by how these requirements are less like atraditional TREC topic statement and more like the kind ofinformation garnered in a traditional, well-conducted, referenceinterview [4,5]. This suggests that the involvement of professionalmusic librarians in the development of the TREC-like musicquery records is very important — perhaps even critical.

As one can see, the project needs to collect and analyze dataconcerning real-world users, the ways in which they express theirneeds, and how they intend to use the results of their searches. Tothis end, we are constructing a multifaceted research programmeaimed at capturing these important facts through a variety of“needs and uses” studies and human-computer interaction studies.We have tentatively titled this research stream, “Human Use ofMusic Information Retrieval Systems” (HUMIRS). We arecurrently sketching out the roles that the UIUC Music Library(needs and uses) and the NCSA Usability Lab (human-computerinteraction) can play in this regard .

4. REFERENCES[1] Downie, J. S., Music Information Retrieval Annual Review of

Information Science and Technology 37: 295-340, 2003.

[2] Byrd, D. and Crawford, T. C., Problems of MusicInformation Retrieval in the Real World InformationProcessing and Management 38: 249-272, 2002.

[3] Futrelle, J. and Downie, J. S., Interdisciplinary Communitiesand Research Issues in Music Information Retrieval ThirdInternational Conference on Music Information Retrieval:215-221, 2002.

[4] Dewdney, P. and Michell, G., Asking "Why" Questions inthe Reference Interview: A Theoretical Justification LibraryQuarterly 67: 50-71, 1997.

[5] The Reference Interview. Reference and InformationServices: An Introduction, eds. Bopp, R. E. and Smith, L. C.Englewood, CO: Libraries Unlimited: 47-68, 2001.

Figure 1. Schematic of the secure, yet accessible, test collection environment.

129