please scroll down for article - columbia …dpwe/pubs/mandele08-majorminer.pdf · easy, useful,...

16
PLEASE SCROLL DOWN FOR ARTICLE This article was downloaded by: [Columbia University] On: 1 June 2009 Access details: Access Details: [subscription number 908034376] Publisher Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Journal of New Music Research Publication details, including instructions for authors and subscription information: http://www.informaworld.com/smpp/title~content=t713817838 A Web-Based Game for Collecting Music Metadata Michael I. Mandel a ; Daniel P. W. Ellis a a Columbia University, USA Online Publication Date: 01 June 2008 To cite this Article Mandel, Michael I. and Ellis, Daniel P. W.(2008)'A Web-Based Game for Collecting Music Metadata',Journal of New Music Research,37:2,151 — 165 To link to this Article: DOI: 10.1080/09298210802479300 URL: http://dx.doi.org/10.1080/09298210802479300 Full terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdf This article may be used for research, teaching and private study purposes. Any substantial or systematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

Upload: lyminh

Post on 18-Aug-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PLEASE SCROLL DOWN FOR ARTICLE - Columbia …dpwe/pubs/MandelE08-majorminer.pdf · easy, useful, and objective. Participants describe 10 second clips of songs and score points when

PLEASE SCROLL DOWN FOR ARTICLE

This article was downloaded by: [Columbia University]On: 1 June 2009Access details: Access Details: [subscription number 908034376]Publisher RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

Journal of New Music ResearchPublication details, including instructions for authors and subscription information:http://www.informaworld.com/smpp/title~content=t713817838

A Web-Based Game for Collecting Music MetadataMichael I. Mandel a; Daniel P. W. Ellis a

a Columbia University, USA

Online Publication Date: 01 June 2008

To cite this Article Mandel, Michael I. and Ellis, Daniel P. W.(2008)'A Web-Based Game for Collecting Music Metadata',Journal of NewMusic Research,37:2,151 — 165

To link to this Article: DOI: 10.1080/09298210802479300

URL: http://dx.doi.org/10.1080/09298210802479300

Full terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdf

This article may be used for research, teaching and private study purposes. Any substantial orsystematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply ordistribution in any form to anyone is expressly forbidden.

The publisher does not give any warranty express or implied or make any representation that the contentswill be complete or accurate or up to date. The accuracy of any instructions, formulae and drug dosesshould be independently verified with primary sources. The publisher shall not be liable for any loss,actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directlyor indirectly in connection with or arising out of the use of this material.

Page 2: PLEASE SCROLL DOWN FOR ARTICLE - Columbia …dpwe/pubs/MandelE08-majorminer.pdf · easy, useful, and objective. Participants describe 10 second clips of songs and score points when

A Web-Based Game for Collecting Music Metadata

Michael I. Mandel and Daniel P.W. Ellis

Columbia University, USA

Abstract

We have designed a web-based game, MajorMiner, thatmakes collecting descriptions of musical excerpts fun,easy, useful, and objective. Participants describe 10second clips of songs and score points when theirdescriptions match those of other participants. The ruleswere designed to encourage players to be thorough andthe clip length was chosen to make judgments objectiveand specific. To analyse the data, we measured the degreeto which binary classifiers could be trained to spotpopular tags. We also compared the performance of clipclassifiers trained with MajorMiner’s tag data to thosetrained with social tag data from a popular website. Onthe top 25 tags from each source, MajorMiner’s tagswere classified correctly 67.2% of the time, while thesocial tags were classified correctly 62.6% of the time.

1. Introduction

The easiest way for people to find music is by describingit with words. Whether this is from a friend’s recom-mendation, browsing a large catalogue to a particularregion of interest, or searching for a specific song, verbaldescriptions, although imperfect, generally suffice. Whilethere are notable community efforts to verbally describelarge corpora of music, e.g. Last.fm, these efforts cannotsufficiently cover new, obscure, or unknown music.Efforts to pay expert listeners to describe music, e.g.Pandora.com, suffer from similar problems and are slowand expensive to scale. It would be useful in these casesto have an automatic music description system, thesimplest of which would base its descriptions wholly onthe audio and would require human generated descrip-tions on which to train computer models. An example

use of such a system can be seen in Figure 1 along withthe data that can be used to train it.

Building a computer system that provides sound-basedmusic descriptors requires human descriptions of thesound itself for training data. Such descriptions are mostlikely to be applied to clips, short segments of relativelyobscure songs that provide little context to the listener.Mining available writings on music is not sufficient, asonly a small portion of the vast quantity of these, e.g.record reviews and blogs, describes aspects of the sounditself, most describes the music’s social context. Broaddiscussions of genre or style generally focus on the socialaspects of music, while specific descriptions of a shortsegment generally focus on the sound. Similarly, due tothe heterogeneous nature of musical style, it is not certainthat a description of a genre or style applies to all possiblesegments of music within that style. Shorter clips are morelikely to be homogeneous, which makes the connectionbetween language and music more definite.

Thus in this project, we endeavour to collect groundtruth about specific, objective aspects of musical soundsby asking humans to describe clips in the context of aweb-based game1. Such a game entertains people whilesimultaneously collecting useful data. Not only is thedata collection process interesting, but the game itself isnovel. The overarching goal of collecting thoroughdescriptions of the music shaped many design decisionsabout the game-play, including the rules for scoring, theway in which clips are chosen for each player, and theways in which players can observe each other.

We have also run a number of experiments to test theeffectiveness with which an automatic music classifier canbe trained on the data collected by such a game,

Correspondence: Michael I. Mandel, Columbia University, LabROSA, Department of Electrical Engineering, USA.E-mail: [email protected]

1The game is available to play at http://www.majorminer.org

Journal of New Music Research2008, Vol. 37, No. 2, pp. 151–165

DOI: 10.1080/09298210802479300 � 2008 Taylor & Francis

Downloaded By: [Columbia University] At: 22:09 1 June 2009

Page 3: PLEASE SCROLL DOWN FOR ARTICLE - Columbia …dpwe/pubs/MandelE08-majorminer.pdf · easy, useful, and objective. Participants describe 10 second clips of songs and score points when

comparing binary classification accuracy for manyindividual tags on sets of clips that balance positiveand negative examples. We first measured the effect ofvarying the amount of MajorMiner data available to aclassifier, showing that a six-fold increase in training datacan improve classification for some tags by ten percen-tage points. For the other experiments, we used twovariations of tags from MajorMiner and three variationsof tags from the social music website Last.fm2. Wecompared these datasets on their seven common tags andfound that MajorMiner’s data were more accuratelyclassified on all but two tags. We also comparedclassification accuracy for the 25 tags most frequentlyapplied to our music collection by each dataset.Classifiers trained on MajorMiner’s data achieve anaverage accuracy of 67.2% and those trained onLast.fm’s data achieve 62.6%.

An example of one possible application of this work isshown in Figure 1. To make this figure, Outkast’s song‘‘Liberation’’ was broken down into contiguous 10second clips. While a number of these clips were labelledin the MajorMiner game, most of them were not. Bytraining autotaggers on other songs that were tagged in

the game, we are able to automatically fill in thedescriptions that should be applied to the unseen clips.The agreement between the autotags and the tagscollected in the game for this song can be seen in thefigure as dark regions of high predicted relevancecorrespond to outlined regions of human judgements.In addition, the structure of the song becomes visible asthe autotags change in response to changes in the music.

1.1 Example of game play

An example session of the game will provide a sense of itsrules and strategy. A screenshot of this example sessioncan be seen in Figure 2. First the player, mim, requests anew clip to be tagged. This clip could be one that otherplayers have seen before or one that is brand new, hedoes not know which he will receive. He listens to the clipand describes it with a few words: slow3, harp, female,sad, love, fiddle, and violin. The words harp and love havealready been used by one other player, so each of themscores mim one point. In addition, the players who firstused each of those words have, at this time, two pointsadded to their scores (regardless of whether they are

Fig. 1. Automatic tagging of all ten second segments within a track, illustrating one goal of this work. The top pane shows the mel-scale spectrogram of OutKast’s ‘‘Liberation’’, with major instrumentation changes manually labelled (‘‘m vx’’¼male voice,‘‘f vx’’¼ female voice, ‘‘m vxs’’¼multiple male voices, ‘‘pi’’¼ piano, ‘‘bs’’¼ bass guitar, ‘‘dr’’¼ drums, ‘‘þ’’ indicates instruments

added to the existing ensemble). The lower pane shows the output of automatic classifiers trained on the 14 shown labels. Note, forexample, how the switch to a single female lead vocal from 4:04 to 4:52 is strongly detected by the labels ‘‘female’’, ‘‘jazz’’, ‘‘r&b’’,‘‘soul’’, and ‘‘urban’’. The six columns of outlined cells indicate the six clips for which human tags were collected in the MajorMiner

game; thin outlines indicate that the tag was used at least once, and thick outlines show the most common tags for each clip. Noticethat on the whole human tags are consistent with higher scores from the classifiers.

2http://www.last.fm/ and http://www.audioscrobbler.com 3We will use bold font to denote tags.

152 Michael I. Mandel and Daniel P.W. Ellis

Downloaded By: [Columbia University] At: 22:09 1 June 2009

Page 4: PLEASE SCROLL DOWN FOR ARTICLE - Columbia …dpwe/pubs/MandelE08-majorminer.pdf · easy, useful, and objective. Participants describe 10 second clips of songs and score points when

currently playing the game). Since the words female andviolin have already been used by at least two players, theyscore mim zero points. The words sad and fiddle have notbeen used by anyone before, so they score no pointsimmediately, but have the potential to score two pointsfor mim at some later time should another player use one.

When the player has finished tagging his clips he cango to his game summary, an example of which can beseen in Figure 3. The summary shows both clips that hehas recently seen and those that he has recently scoredon, e.g. if another player has agreed with one of his tags.It also reveals the artist, album, and track names of eachclip and allows mim to see one other player’s tags foreach clip. In the figure, the other player has already

scored two points for describing the above clip with bass,guitar, female, folk, violin, love, and harp, but has notscored any points yet for acoustic, country, drums, orstorms. When he is done, mim logs out. The next time helogs in, the system informs him that three of hisdescriptions have been used by other players in theinterim, scoring him six points while he was gone.

1.2 Previous work

A number of authors have explored the link betweenmusic and text. Whitman and Ellis (2004) trained asystem for associating music with noun phrases andadjectives from a collection of reviews from the AllMusic Guide and Pitchfork Media. This work was basedon the earlier work described by Whitman and Rifkin(2002). More recently, Turnbull et al. (2006) used a naiveBayes classifier to both annotate and retrieve musicbased on an association between the music and text. Thiswork is inspired by similar work in the field of imageretrieval, such as Barnard et al. (2003) and Carneiro andVasconcelos (2005). Eck et al. (2008) used boostedclassifiers to identify the top k tags describing aparticular track, training the classifiers on tags that theusers of Last.fm had entered for the track’s artist.

There have also been a number of games designed tocollect metadata about multimedia. Ellis et al. (2002)described the ‘‘Musicseer’’ system for collecting groundtruth about artist similarity, one aspect of which was agame. In this work, players chose which of a list of artistswas most similar to a goal artist. The purpose of thegame was to link a starting artist to a goal artist with asshort a chain of intermediate similar artists as possible.By performing this forced choice of the most similarartist from a list, triplets of relative similarity werecollected, which could then be used to infer a fullsimilarity matrix. von Ahn and Dabbish (2004) describedthe ‘‘ESP Game’’4, which asks pairs of players to describethe same image. Once both players agree on a word, theyscore a certain number of points and move on to the nextimage. The players attempt to agree on as many imagesas possible within a time limit. While previous data-collection games maintained data integrity by forcingplayers to choose from predefined responses, this was thefirst game to popularize the idea of allowing anyresponse, provided it was verified by a second player.

1.3 Recent music games

Inspired by the success of the ESP game, a number ofmusic-related human computation games have beendeveloped. These games aim to provide both entertain-ment for the players and useful metadata about music forthe game operator. Figure 4 shows screenshots from

Fig. 2. A screenshot of a game in progress. The player describesa 10 second clip of an unknown song. Italicized descriptionshave scored 1 point, red descriptions 0 points, and greydescriptions have scored no points, but will score 2 points if

they are subsequently verified through duplication by anotherplayer.

Fig. 3. A screenshot of the player’s game summary. The artist,

album, track, and start time are listed for clips the player hasrecently seen or scored on. The player can also see their owntags and those of another player. 4Available at http://www.gwap.com/

A web-based game for collecting music metadata 153

Downloaded By: [Columbia University] At: 22:09 1 June 2009

Page 5: PLEASE SCROLL DOWN FOR ARTICLE - Columbia …dpwe/pubs/MandelE08-majorminer.pdf · easy, useful, and objective. Participants describe 10 second clips of songs and score points when

these games and Table 1 compares various aspects ofthese games and MajorMiner.

The ListenGame5 (Turnbull et al., 2007b) is acommunal game in which a group of players simulta-neously selects the best and worst words from six choicesto describe a 15 second clip from a song in the CAL-250dataset (Turnbull et al., 2007a), a set of 250 popularsongs from many genres. Players are rewarded inproportion to the number of other users agreeing withtheir choices. The six choices all come from the samecategory in a pre-defined taxonomy of tags, e.g.instrumentation, usage, genre. The game also includes abonus round where players can suggest a new word to beadded to a particular category.

In Tag a Tune6 (Law et al., 2007), paired players haveto determine whether they are listening to the same songor to different songs. The players enter unconstrainedwords describing a 30 second clip until they can decide ifthey are describing the same song. Players are rewardedwith a certain amount of points for deciding correctlyand with more points for making consecutive correctchoices. The songs come from the Magnatune collectionof Creative Commons licensed music and can be down-loaded after the game. The game also includes a bonusround where players score points by agreeing on which ofthree clips is least similar to the others.

The players in MoodSwings7 (Kim et al., 2008) arealso paired, and each traces a trajectory through acontinuous two-dimensional space of musical emotion asthey listen to the same 30 second clip. The players scorepoints for agreeing with one another, and are awardedmore points when their partner moves to agree withthem, but their partner’s position is only revealedintermittently to avoid biasing the trajectory. As play-back of the clip progresses, the two players need to agreemore closely to continue scoring points. Musical emo-tions are described in terms of Thayer’s (1990) valence-arousal space, in which the valence axis describes themusic in terms of positive versus negative emotion andthe arousal axis describes the music in terms of high-versus low-energy.

2. MajorMiner game design

We designed the MajorMiner game with many goals inmind, but our main goal was to encourage players todescribe the music thoroughly. This goal shaped thedesign of the rules for scoring. Another important goal,which informed the method for introducing new clipsinto the game, was for the game to be fun for both newand veteran players. Specifically, new players should be

Fig. 4. Screenshots from current human computation games for

collecting music metadata: (a) ListenGame, (b) Tag a Tune and(c) MoodSwings.

5http://www.listengame.org/6Available at http://www.gwap.com/7http://schubert.ece.drexel.edu/moodswings/

154 Michael I. Mandel and Daniel P.W. Ellis

Downloaded By: [Columbia University] At: 22:09 1 June 2009

Page 6: PLEASE SCROLL DOWN FOR ARTICLE - Columbia …dpwe/pubs/MandelE08-majorminer.pdf · easy, useful, and objective. Participants describe 10 second clips of songs and score points when

able to score points immediately, and veteran playersshould be rewarded with opportunities to score addi-tional points.

Other, lesser goals inform the architecture andimplementation of the game. The first of these was theavoidance of the ‘‘cold start’’ problem. Specifically,players should be able to score points as soon as possibleafter the launch of the game, and a single player shouldbe able to play any time he or she wants, without theneed for others to be playing at the same time. Anotherwas to avoid the possibility of cheating, collusion, orother manipulations of the scoring system or, worse, thedata collected. Our final goal was to make the gameaccessible to as many people as possible, implementing itas a standard web page, without requiring any specialplugins, installation, or setup.

While many games team one player with a singlepartner, ours, in a sense, teams one player with all ofthe other players who have ever seen a particular clip.When a player is paired with a single cooperator, it ispossible that the two players could be at very differentskill levels, have different levels of familiarity with theclip under consideration, or even speak differentlanguages, detracting from each player’s enjoyment ofthe game. It is also possible that during times of lowusage only one player might be online at a time(although this problem has been addressed in othergames by replaying recorded sessions). The non-pairedformat, on the other hand, allows the most compatible,creative, or expert players to cooperate with each otherasynchronously, since an obscure description used byone player will remain available until a second playerverifies it. It also provides a more transparent means ofintroducing new clips into the labelled dataset, asopposed to pretending that a user can score on a newclip when playing against a mostly prerecorded game.These benefits of non-paired games come at the price ofvulnerability to asynchronous versions of the attacksthat afflict paired games. For example, collusion in apaired game is only possible between players who are

paired with each other, but it is possible in a non-pairedgame between any two players who have seen aparticular clip.

2.1 Scoring

The design of the game’s scoring rules reflects our maingoal: to encourage players to thoroughly describe clips inan original, yet relevant way. In order to foster relevance,players only score points when other players agree withthem. In order to encourage originality, players are givenmore points for being the first to use a particulardescription on a given clip. Originality is also encouragedby giving zero points for a tag that two other playershave already agreed upon.

The first player to use a particular tag on a clip scorestwo points when it is verified by a second player, whoscores one point. Subsequent players do not score anypoints for repeating that tag. These point allocations (2,1, 0) need not be fixed and could be changed dependingon participation and the rate at which new music isintroduced into the game. The number of players whoscore points by verifying a tag could be increased toinflate overall scoring and point amounts could also bechanged to influence the general style of play. We havefound, however, that this simple scoring scheme suffi-ciently satisfies our goal of encouraging players to bethorough. One concern with this system is that laterplayers could be discouraged if all of the relevantdescriptions have already been used by two other players.By carefully choosing when clips are shown to players,however, we can avoid this problem and use the tensioncreated by the scoring rules to inspire originality withoutinducing frustration.

The game has high-score tables, listing the top 20scoring players over the past day, the past week, and overall time. The principal payoff of the game may be thesatisfaction of reaching some standing in these tables.Including the short-term-based tables gives even newplayers some chance to see their names in lights.

Table 1. Comparison of current human computation games for collecting music metadata. The top half of the table describes aspects

of the games design and the bottom half describes the data collected so far. VA stands for valence-arousal space, see Section 1.3 fordetails. Usage data for Tag a Tune was not available at the time of publication.

MajorMiner ListenGame Tag a Tune MoodSwings

Music database (See Table 4) CAL 250 Magnatune uspop2002

Synchrony Asynchronous Synchronous Synchronous SynchronousPair/group Group Group Pair PairTask Agree on tags Pick best/worst tag Decide if song is same Move in synchronyData collected Free text Multiple choice Free text & decision VA trajectory

Users 490 440 – 100Clips labeled 2300 250 – 1000Data collected 12,000 verif. tags 26,000 choices – 1,700 VA traj.

Unique tags 6400 120 – –

A web-based game for collecting music metadata 155

Downloaded By: [Columbia University] At: 22:09 1 June 2009

Page 7: PLEASE SCROLL DOWN FOR ARTICLE - Columbia …dpwe/pubs/MandelE08-majorminer.pdf · easy, useful, and objective. Participants describe 10 second clips of songs and score points when

2.2 Picking clips

When a player requests a new clip to describe, we havethe freedom to choose the clip any way we want. Thisfreedom allows us to meet our second goal of making thegame fun for both new and experienced players. In orderfor a player to immediately score on a clip, anotherplayer must already have seen it. We therefore maintain apool of clips that have been seen at least once and so areready to be scored on. For new players, we draw clipsfrom this pool to facilitate immediate scoring. Forexperienced players, we generally draw clips from thispool, but sometimes pick clips that have never been seenin order to introduce them into the pool. While such clipsdo not allow a player to score immediately, they do offerthe opportunity to be the first to use many tags, thusscoring more points when others agree.

While clips must be seen by at least one other personto allow immediate scoring, clips that have been seen bymany people are difficult to score on. Since more tags aregenerally verified for clips that have been seen by morepeople, it becomes increasingly difficult for players to beoriginal in their descriptions. We alleviate this problemby introducing new clips regularly and preferentiallychoosing clips that have been seen by fewer players fromthe pool of scorable clips.

In order for players to transition from new toexperienced, we define a continuous parameter g thatindicates a players’ experience level. On each request fora new clip, an unseen clip is chosen with probability gand a scorable clip is chosen with probability 17g. Whenthe website was initially launched, g was defined as theratio of the number of clips a player had seen to thenumber of clips anyone had seen. This preventedproblems from developing when the pool of scorableclips was small. After many clips were added in this way,this definition of experience became too strict. We havenow transitioned to defining an experienced player as onewho has listened to more than 100 clips, at which point greaches its maximum value of 1

3. A brand new player has

a g of 0, and g linearly increases up to that maximum asthe player labels more clips.

This scheme for picking clips has a direct impact onthe number of times each clip is seen, and hence theoverall difficulty of the game. The result, derived inAppendix A, is that at equilibrium, all clips in thescorable pool will have been seen the same number oftimes, where that number is the reciprocal of the expectedvalue of g over all of the players. Computing the expectedvalue of g from the usage data in Figure 5(b), each clipwould have been seen 59 times under the original pickingstrategy, but will only be seen 7 times under the newstrategy.

2.3 Revealing labels

Seeing other players’ descriptions is part of the fun of thegame. It also acclimatizes new players to the words thatthey have a better chance of scoring with. The otherresponses can only be revealed after a player has finishedlabelling a given clip, otherwise the integrity of the dataand the scoring would be compromised. With this inmind, we designed a way to reveal other players’ tagswithout giving away too much information or creatingsecurity vulnerabilities.

We reveal the tags of the first player who has seen aclip, a decision that has many desirable consequences.This person is uniquely identified and remains the sameregardless of how many subsequent players may see theclip. The same tags are thus shown to every player whorequests them for a clip and repeated requests by thesame player will be answered identically, even if othershave tagged the clip between requests. The only personwho sees a different set of labels is that first tagger, whoinstead sees the tags of the second tagger (if there is one).

As described in the previous section, the first player totag a particular clip is likely to have more experience withthe game. Their descriptions are good examples for otherplayers as an experienced player will generally be good atdescribing clips, will have a better idea of what others are

Fig. 5. Histograms of player data. Y-axis is number of players, x-axis is the specified statistic in logarithmic units. (a) Number of pointseach player has scored, (b) number of clips each player has listened to, and (c) number of unique tags each player has used.

156 Michael I. Mandel and Daniel P.W. Ellis

Downloaded By: [Columbia University] At: 22:09 1 June 2009

Page 8: PLEASE SCROLL DOWN FOR ARTICLE - Columbia …dpwe/pubs/MandelE08-majorminer.pdf · easy, useful, and objective. Participants describe 10 second clips of songs and score points when

likely to agree with, and will know what sort offormatting to use. Their tags can thus serve as goodexamples to others.

Also, in order to avoid introducing extra-musicalcontext that might bias the player, we only reveal thename of the artist, album, and track after a clip isfinished being labelled. This focuses the player muchmore on describing the sounds immediately present andless on a preconception of what an artist’s music soundslike. It is also interesting to listen to a clip withoutknowing the artist and then compare the sound tothe preconceptions one might have about the artistafterward.

2.4 Strategy

When presented with a new clip, a player does not knowwhich tags have already been applied to it. Trying one ofthe more popular tags will reveal how many times thattag has been used and thus the approximate number oftimes the clip has been seen. If the popular tag has neverbeen used or has been used only once, the player canapply other popular tags and collect points relativelyeasily. If the tag has already been used twice, however, itis likely to be more difficult to score on the clip. Theplayer must then decide whether to be more original orgo on to another clip.

This clip-wise strategy leads to two overall strategies.The first is to be as thorough as possible, scoring pointsboth for agreeing with existing tags and by using originaltags that will later be verified. By agreeing with existingtags, the thorough player both collects single-point scoresand prevents future listeners from scoring on those tags.By using original tags, the thorough player will setupmany two-point scores when subsequent players en-counter the same clip. The second strategy is to listen toas many clips as possible, trying to use popular tags onclips that haven’t been seen before.

While having a large number of clips with popularlabels is worthwhile, in-depth analysis is more useful forus. To encourage breadth of description, we could add acost to listening to clips or to posting tags, which wouldmotivate players to use tags they were more certainwould be verified by others. Similarly, we could post thehigh scores in terms of the number of points scored perclip heard or the number of points scored per tag. Theseadjustments to the scoring philosophy would encourageplayers to be more parsimonious with their tagging andlistening. We have not yet encountered the shallow,speedy strategy and so have not instituted such measures.

We have guarded against a few possible exploits of thesystem. The first is collusion between two players, or thesame person with two different usernames. Two playerscould, in theory, communicate their tags for particularclips to each other and score on all of them. We thwartthis attack by making it difficult for players to see clips of

their choosing and by adding a refractory period betweenpresentations of any particular clip. Since players canonly see their most recent clips, we also never refer toclips by an absolute identifier, only by relative positionsin the recently seen and recently scored lists, making itmore difficult to memorize which clips have been seen.Another potential exploit of the system is an extremeform of the speedy strategy in which a player repeatedlyuses the same tag or tags on every clip, regardless of themusic. This is easily detected and can be neutralized bydisabling the offending account.

3. Data collected

At the time of this paper’s writing, the site has been livefor 11 months, in which time 489 players have registered.A total of 2308 clips have been labelled, being seen by anaverage of 6.99 players each, and described with anaverage of 31.24 tags each, 5.08 of which have beenverified. See Table 2 for some of the most frequently used

Table 2. The 25 most popular tags in the MajorMiner game.

Three measures of tag popularity are provided: the number ofclips on which the tag was verified by two players, the totalnumber of times the tag was used, including unverified uses and

uses by more than two players, and the number of players whohave ever used the tag.

Label Verified Used Players

drums 908 3032 114

guitar 837 3146 176male 723 2450 95rock 658 2614 198electronic 484 1855 127

pop 477 1755 148synth 471 1770 85bass 417 1632 99

female 342 1387 100dance 321 1242 115techno 244 933 101

piano 179 826 120electronica 167 679 66vocal 163 824 76

synthesizer 162 681 49slow 157 726 90rap 151 723 129voice 140 735 50

hip hop 139 535 97jazz 129 696 149vocals 128 704 50

beat 125 628 7680s 111 488 69fast 109 494 70

instrumental 102 536 62

A web-based game for collecting music metadata 157

Downloaded By: [Columbia University] At: 22:09 1 June 2009

Page 9: PLEASE SCROLL DOWN FOR ARTICLE - Columbia …dpwe/pubs/MandelE08-majorminer.pdf · easy, useful, and objective. Participants describe 10 second clips of songs and score points when

descriptions and Figures 5 and 6 for histograms of somestatistics of player and clip data, respectively.

The system was implemented as a web applicationusing the Ruby on Rails framework. The player needsonly a browser and the ability to play mp3s, althoughjavascript and flash are helpful and improve the gameplaying experience. The page and the database are bothserved from the same Pentium III 733 MHz with 256 MBof RAM. This rather slow server can sustain tens ofsimultaneous players.

The type of music present in the database affects thelabels that are collected, and our music is relativelyvaried. By genre, it contains electronic music, indie rock,hip hop, pop, country, mainstream contemporary rock,and jazz. Much of the music is from independent or moreobscure bands, which diminishes the biases and extracontext that come from the recognition of an artist orsong. See Table 4, Section 4.1, for the tags that users ofLast.fm have applied to this music collection.

Those 2308 clips were selected at random from acollection of 97,060 clips, which exhaustively cover 3880tracks without overlap. The clips that were selected camefrom 1441 different tracks on 821 different albums from489 different artists. This means that the average artisthad 4.7 clips tagged, the average album had 2.8 clipstagged, and the average track had 1.6 clips tagged. Themost frequently seen items, however, had manymore clips tagged. Saint Etienne had 35 of theirclips tagged, Kula Shaker’s album K had 12 of its clipstagged, and Oasis’ track ‘‘Better Man’’ had 9 of its clipstagged.

Certain patterns are observable in the collecteddescriptions. As can be seen in Table 2, the most populartags describe genre, instrumentation, and the gender ofthe singer, if there are vocals. People use descriptivewords, like soft, loud, quiet, fast, slow, and repetitive, butdo so less frequently. Emotional words are even lesspopular, perhaps because they are difficult to verbalize ina way that others will likely agree with. There are hardly

any words describing rhythm, except for an occasionalbeat tag.

Since any tag is allowed, players can and do use thenames of artists they recognize. For example, u2 has beenverified 15 times, depeche mode 12 times, and bowie 8times. Only five of the clips verified as bowie wereactually performed by David Bowie, however, the otherthree were performed by Gavin Friday, Suede, and Pulp.One need not take these descriptions literally; they could,for instance, be indicating a similarity between theparticular clip from Suede’s song ‘‘New Generation’’and some aspect or era of David Bowie’s music. Thesecomparisons could indicate artists who are goodcandidates for the anchors in an anchor space of artistdescriptions (Berenzweig et al., 2003). Such a systemwould describe new artists by their musical relationshipsto well known artists.

Another valid and easily verified description of a clipis its lyrical content, if decipherable. Ten seconds aregenerally enough to include a line or two of lyrics, whichthe player then must distill down to one or two words.This has the added benefit of isolating some of the moreimportant words from the lyrics, since players want tomake their descriptions easy for others to match.Currently, love seems to be the most popular lyric word,with 19 verified uses.

The top tags in the game have been quite stable since itbegan. We would like to believe that this stability resultsfrom their being the most appropriate words to describethe music in our database. It is possible, however, thatthe first players’ choices of words had a disproportionateimpact on the vocabulary of the game. This might havehappened through the feedback inherent in the pointscoring system and in the revealing of players’ descrip-tions. If the first players had used different words, thosewords might now be the most popular. It would not bedifficult to divide the players between separate gameworlds in order to test this hypothesis, although we donot currently have enough players to attempt this.

Fig. 6. Histograms of clip data. Y-axis is number of clips, x-axis is the specified statistic. (a) Number of tags that have been used by atleast two players on each clip, (b) number of unique tags that have been applied to a clip, and (c) number of tags that have been

applied to a clip.

158 Michael I. Mandel and Daniel P.W. Ellis

Downloaded By: [Columbia University] At: 22:09 1 June 2009

Page 10: PLEASE SCROLL DOWN FOR ARTICLE - Columbia …dpwe/pubs/MandelE08-majorminer.pdf · easy, useful, and objective. Participants describe 10 second clips of songs and score points when

3.1 Data normalization

In the initial implementation of the system, tags onlymatched when they were identical to each other. This wastoo strict a requirement, as hip hop should match hip–hop

in addition to misspellings and other variations inpunctuation. Since we still had all of the tagging data,however, it was possible to perform an offline analysis ofthe tags, i.e. replay the entire history of the game, tocompare the use of different matching metrics. Below, wedescribe the metric that we settled on for the matching inthe previously collected data. After experimenting on theexisting data, we implemented a similar scheme in the livegame website and re-scored all of the previous game-play.

Our offline data analysis consisted of a number ofmanual and semi-supervised steps. We began with 7698unique tags. First, we performed a spell check on thecollection of tags, in which misspellings were correctedby hand, reducing the total number of tags to 7360.Then, we normalized all instances of &, and, ‘n’, etc.,leaving us with 7288 tags. Next, we stripped out all of thepunctuation and spaces, reducing the collection to 6987tags. And finally, we stemmed the concatenated tag,turning plural forms into singular forms and removingother suffixes, for a final count of 6363 tags, a totalreduction of 1335 duplicate tags. The first steps generallymerged many unpopular tags with one popular tag, forexample all of the misspellings of synthesizer. The latersteps tended to merge two popular tags together, forexample synth and synthesizer. Merging two popular tagsgenerally affects the scoring of the game overall, whilemerging orphans improves the game playing experienceby making the game more forgiving.

Because of the way the rules of the game are defined,merging tags affects the score for a particular clip,increasing it when a single player used each version, butpossibly decreasing it if both versions had already beenverified. Not only does merging tags affect the score on aparticular clip, but it also affects the total number ofmatching tags in the game. A net increase in scoringmeans that merging two tags was beneficial in unitingplayers with slightly different vocabularies, while a netdecrease in scoring means that players might have beenusing two tags to try to increase their scores withoutconveying new information. See Table 3 for examples ofthe tags that were affected the most by merging and howmuch their merging affected the overall number ofverified tags.

4. Autotagging experiments

Autotagging is the process of automatically applyingrelevant tags to a musical excerpt. While the system ofEck et al. (2008) selects the best k tags to describe a clip,

we pose the problem as the independent classification ofthe appropriateness of each potential tag. Each tag isgenerally only applied to a small fraction of the clips, soin this version of autotagging there are many morenegative than positive examples. Since the support vectormachines (SVMs) we use are more effective when bothclasses are equally represented in the training data, werandomly select only enough negative examples tobalance to positive examples. For testing, we similarlybalance the classes to provide a fixed baseline classifica-tion accuracy of 0.5 for random guessing.

To compare different sets of tags, we use the systemdescribed by Mandel and Ellis (2008), which wassubmitted to the MIREX evaluations (Downie et al.,2005) on music classification in 2007. While the results ofsuch experiments will certainly vary with the details ofthe features and classifier used and many musicclassification systems have been described in the litera-ture (e.g. Turnbull et al., 2006; Lidy et al., 2007; Pachet &Roy, 2007; Eck et al., 2008), these experiments are meantto provide a lower bound on the amount of usefulinformation any autotagging system could extract fromthe data. Furthermore, our system is state of the art,having achieved the highest accuracy in the MIREX 2005audio artist identification task and performed among thebest in the MIREX 2007 audio classification tasks.

The system primarily uses the spectral features fromMandel and Ellis (2005), but also uses temporal featuresthat describe the rhythmic content of each of the clips.For each tag, these features were classified with a binarysupport vector machine using a radial basis functionkernel. The performance of the classifiers was measuredusing 3-fold cross-validation, in which 2

3 of the data wasused for training and 1

3 for testing. Since some of the tags

Table 3. Number of verified uses of each tag before and aftermerging slightly different tags with each other. Merging tags

can lead to a reduction in verified uses when several separately-verified variants of a term are merged, leaving just a singleverified use.

Tags Initial Merged

Net

change

vocal, vocals 163þ 128 355 þ64hip hop, hiphop,hip-hop

139þ 29þ 22 166 þ24

drum and bass,drum n bass, . . .

16þ 6þ . . . 48 *þ 24

beat, beats 125þ 9 154 þ20horn, horns 27þ 15 48 þ15drum, drums 908þ 54 962 080s, 80’s 111þ 28 130 79

synth, synthesizer,synthesizers

471þ 162þ 70 498 7205

A web-based game for collecting music metadata 159

Downloaded By: [Columbia University] At: 22:09 1 June 2009

Page 11: PLEASE SCROLL DOWN FOR ARTICLE - Columbia …dpwe/pubs/MandelE08-majorminer.pdf · easy, useful, and objective. Participants describe 10 second clips of songs and score points when

had relatively few positive examples, we repeated thecross-validation experiment for 3 different randomdivisions of the data, increasing the total number ofevaluations to 9. We ensured the uniform size of alltraining and test sets for classifiers that were beingcompared to each other. When there were more positiveor negative clips than we needed for a particular tag, thedesired number was selected uniformly at random.

We used two sets of tags from MajorMiner. The firstconsists of all of the (clip, tag) pairs that had beenverified by at least two people as being appropriate. Wecall this dataset ‘‘MajorMiner verified’’. The secondconsists of all of the (clip, tag) pairs that were submittedto MajorMiner. We call this dataset ‘‘MajorMiner all’’.The game is designed to collect the verified dataset, andthe inclusion of the complete dataset and its poorerperformance shows the usefulness of tracking when tagsare verified.

4.1 Social tag data

Using the audioscrobbler web API, we collected all of thesocial tag data that users of the website Last.fm hadapplied to MajorMiner’s corpus of music. We call these‘‘social’’ tags because there is a great deal of communal

collaboration that goes into applying them, much ofwhich occurs on the social networks of music websiteslike Last.fm (Eck et al., 2008). Last.fm users can tagartists, albums, and tracks, but not clips. To label a clipwith these data, then, we used the tags that had beenapplied to its source track along with that track’s albumand artist. Using artist and album tags is necessarybecause many tracks have not been tagged much or evenat all on Last.fm. The music in our corpus was generallywell tagged on Last.fm: only 15 out of 489 artists hadfewer than 5 tags associated with them, as did 357 of 821albums, and 535 of 1441 tracks. By combining tags fromthese three sources, we have at least one tag for 2284 ofour 2308 clips. Of these, 1733 clips have at least one tracktag, 1684 have at least one album tag, and 2284 have atleast one artist tag.

Last.fm supplies tags with a ‘‘count’’ parameterbetween 0 and 100, representing how much that tagapplies to a particular item. While the exact algorithmused to calculate this count is not publicly available, itappears to be similar to a term-frequency, inversedocument frequency measure (TF-IDF). We createdthree different datasets by thresholding these counts atthree different values, 25, 50, and 75, which we callLast.fm 25, Last.fm 50, and Last.fm 75, respectively. We

Table 4. Top 25 tags describing our music from Last.fm with a ‘‘count’’ of at least 25, 50, or 75. Each tag is given with the number ofclips it has been associated with out of the 2308 total clips.

Last.fm 25 Clips Last.fm 50 Clips Last.fm 75 Clips

rock 1229 albums I own 816 albums I own 688alternative 1178 electronic 765 indie 552indie 992 rock 735 rock 526

albums I own 935 indie 729 electronic 481electronic 871 alternative 507 90s 341indie rock 668 90s 473 alternative 287

90s 608 pop 345 pop 272pop 591 britpop 255 britpop 223electronica 542 female vocalist 244 mistagged artist 187

80s 517 80s 226 Hip-Hop 185british 513 idm 209 new wave 140female vocalists 479 new wave 208 female vocalist 138britpop 377 electronica 199 singer-songwriter 117

alternative rock 354 Hip-Hop 192 classic rock 104trip-hop 318 indie rock 187 jazz 103jazz 303 mistagged artist 187 post-punk 103

classic rock 302 oldies 157 00s 98idm 276 singer-songwriter 151 80s 93downtempo 263 00s 146 shoegaze 93

trance 238 british 138 trip-hop 92new wave 229 trip-hop 133 electronica 83dance 227 post-punk 131 indie rock 76

soul 216 ambient 125 collected 7500s 214 jazz 122 ambient 70Hip-Hop 211 collected 121 british 68

160 Michael I. Mandel and Daniel P.W. Ellis

Downloaded By: [Columbia University] At: 22:09 1 June 2009

Page 12: PLEASE SCROLL DOWN FOR ARTICLE - Columbia …dpwe/pubs/MandelE08-majorminer.pdf · easy, useful, and objective. Participants describe 10 second clips of songs and score points when

counted a tag as having been applied to a clip when ithad at least the threshold count for that clip’s artist,album, or track. See Table 4 for a list of the most populartags from each of these three datasets and the number ofclips associated with each. In comparing this to Table 2,it can be seen that the Last.fm and MajorMiner tag setsshare many concepts, although they vary in manyparticulars.

After normalizing these tags in much the same way asthe MajorMiner tags, we were left with 477 unique tagsat the 75-count threshold, 751 at the 50-count threshold,and 1224 at the 25-count threshold. The tags fromLast.fm are in general genre-related, although there aresome non-genre tags like 90s, albums I own, and female

vocalist. The tags from MajorMiner contain many genreterms, but also terms about the music’s sonic character-istics and its instrumentation.

4.2 Classification with MajorMiner data

Our first experiment measured the effect of varying theamount of training data on classification accuracy. Wefirst evaluated all of the MajorMiner tags that had beenverified at least 50 times, sampling 50 clips for those tagsthat had been verified on more than 50 clips. Since weused a three-way cross-validation, this means thatapproximately 33 positive examples and 33 negativeexamples were used for training on each fold. The resultscan be seen as the smallest markers in Figure 7 and thetags are sorted by their average classification accuracy onthis task. Notice that even using 33 positive examples,results are quite good for tags such as rap, house, jazz,and electronica.

In general, classification accuracy was highest forgenres and lowest for individual instruments. This makessense because the spectral features we use in the classifiersdescribe overall timbre of the sound as opposed todistinct instruments. One exception to this is saxophone,which can be identified quite accurately. This anomaly isexplained by the tag’s strong correlation with the genrejazz. Tags with intermediate performance include de-scriptive terms such as dance, distortion, and fast, whichare classified relatively well.

As the amount of training data increased, so too didclassification accuracy. For tags like male, female, rock,dance, and guitar an increase in training data from 33 to200 tags improved accuracy by 10–20 percentage points.This trend is also evident, although less pronounced, inless popular tags like slow, 80s, and jazz. No tagperformed significantly worse when more training datawas available, although some performed no better.

4.3 Direct comparison with social tags

Certain tags were popular in all of the MajorMiner andLast.fm datasets and we can directly compare the

accuracy of classifiers trained on the examples from eachone. See Figure 8 for the results of such a comparison, inwhich the tags are sorted by average classificationaccuracy. Each of these classifiers was trained on 28positive and 28 negative examples from a particulardataset. For five out of these seven tags the verifiedMajorMiner tags performed best, but on the rock andcountry tags the classifiers trained on Last.fm 25 andLast.fm 75, respectively, were approximately 8 percen-tage points more accurate. The verified MajorMiner tagswere always classified more accurately than the completeset of MajorMiner tags. Of the three Last.fm datasets,Last.fm 75 just barely edges out the other two, althoughtheir accuracies are quite similar.

This performance difference can be attributed to threesources of variability. The first is the amount that theconcept being classified is captured in the features thatare input to the classifier. This variation is mainlyexhibited as the large differences in performance betweentags, as some concepts are more closely tied to the sound

Fig. 7. Mean classification accuracy for the top 30 verified tags

from MajorMiner using different amounts of training data.Larger markers indicate larger amounts of training data: 33, 48,68, 98, 140, 200 positive examples with an equal number of

negative examples. The test sets all contained 10 positive and 10negative examples, which means that for these plots N=180,and a difference of approximately 0.07 is statistically significant

under a binomial model.

A web-based game for collecting music metadata 161

Downloaded By: [Columbia University] At: 22:09 1 June 2009

Page 13: PLEASE SCROLL DOWN FOR ARTICLE - Columbia …dpwe/pubs/MandelE08-majorminer.pdf · easy, useful, and objective. Participants describe 10 second clips of songs and score points when

than others and of those some are more closely tied tosonic characteristics that are captured in our features.While this source of variability is present in bothMajorMiner and Last.fm data, we believe that Major-Miner’s tags are more sonically relevant because they areapplied to clips instead of larger musical elements. Also,the Last.fm datasets contain some extra-musical tags likealbums I own, which are generally not discernible fromthe sound.

The second source of variability is inter-subjectvariability, caused by differences between individuals’conceptions of a tag’s meaning. Because of its collabora-tive nature, MajorMiner promotes agreement on tags.Last.fm also promotes such agreement through its‘‘count’’ mechanism, which scores the appropriatenessof a tag to a particular entity. The difference inperformance between classifiers trained on the twoMajorMiner datasets shows that verified instances oftags are more easily predicted. A comparison of theLast.fm tags from the three different thresholds showssome evidence that tags with a higher ‘‘count’’ are moreeasily learned by classifiers, although the overall differ-ences are minor.

The final source of variability is intra-subject varia-bility, caused by the inconsistencies in an individual’sconception of a tag’s meaning across multiple clips. This

source of variability is present in all datasets, although itcan be mitigated by using expert labellers, who havebetter-defined concepts of particular musical descriptorsand are thus more consistent in their taggings. Eventhough no expert knows all areas of music equally well, itshould be possible to construct a patchwork of differentexperts’ judgments that is maximally knowledgeable.MajorMiner’s non-paired (or one-paired-with-all) game-play allows experts in different types of music tocollaborate asynchronously. As long as they use adistinct vocabulary, it will also select their areas ofexpertise automatically. It is not clear from our data howmany such experts have played the game, but we suspectthat this phenomenon will emerge with greater participa-tion.

4.4 Overall performance

Finally, we compared the accuracy of classifiers trainedon the top 25 tags from each of the datasets. The overallmean accuracy along with its standard deviation can beseen in Table 5. The variance is quite high in those resultsbecause it includes inter-tag and intra-tag variation,inter-tag being the larger of the two. For a breakdown ofperformance by tag and cross-validation fold, see Figure9, from which it is clear that the variance is generallyquite low, although exceptions do exist.

While these tags are by no means uncorrelated withone another, we believe that it is meaningful to averagetheir performance as they are separate tokens that peoplechose to use to describe musical entities. While many ofMajorMiner’s players might consider hip hop and rap tobe the same thing, they are not perfectly correlated, andcertain clips are more heavily tagged with one than theother. For example, while the track shown in Figure 1might be considered to fall entirely within the hip hopgenre, only certain sections of it include rapping. Thosesections are particularly heavily tagged rap, while the restof the song is not.

Overall, the average classification accuracy was high-est for the top 25 verified tags from MajorMiner. Thefact that verified tags make better training data than all

Table 5. Overall mean and standard deviation classificationaccuracy for each dataset on its 25 most prevalent tags. Allclassifiers were trained on 40 positive and 40 negative examples

of each tag.

Dataset Mean Std

MajorMiner verified 0.672 0.125MajorMiner all 0.643 0.109Last.fm 50 0.626 0.101

Last.fm 25 0.624 0.104Last.fm 75 0.620 0.111

Fig. 8. Comparison of classification accuracy between data setsfor tags that appeared in all of them. In the evaluation, N=180,

so for the accuracies encountered a difference of approximately0.07 is statistically significant under a binomial model.

162 Michael I. Mandel and Daniel P.W. Ellis

Downloaded By: [Columbia University] At: 22:09 1 June 2009

Page 14: PLEASE SCROLL DOWN FOR ARTICLE - Columbia …dpwe/pubs/MandelE08-majorminer.pdf · easy, useful, and objective. Participants describe 10 second clips of songs and score points when

tags supports the reasoning behind our design of thescoring rules. Thresholding the ‘‘count’’ of a tag madelittle difference in the mean accuracy of the Last.fmdatasets, although there were differences between thedatasets’ performances on specific tags. For low thresh-olds, there are a small number of tags that perform quite

well, while most perform poorly. As the threshold isincreased, there are fewer stand-out tags and theclassification becomes more consistent across tags.

The top 25 tags for our clips on Last.fm do not includemusical characteristics or instruments, but do includesome extra-musical tags like albums I own, indie, british,

Fig. 9. Classification accuracy for the top 25 tags from each dataset. Dots are individual cross-validation folds, plus signs are means,and the horizontal bars are standard deviations. The vertical grey line indicates chance performance, 0.5. In the evaluation, N¼ 252, sounder a binomial model an accuracy of 40.563 is significantly better than random (0.5).

A web-based game for collecting music metadata 163

Downloaded By: [Columbia University] At: 22:09 1 June 2009

Page 15: PLEASE SCROLL DOWN FOR ARTICLE - Columbia …dpwe/pubs/MandelE08-majorminer.pdf · easy, useful, and objective. Participants describe 10 second clips of songs and score points when

and 90s. Such tags are not well classified by our systembecause they have less to do with the sound of the musicthan with the cultural context surrounding it. Exceptionsdo arise because of correlations in the data, for example,Last.fm’s albums I own is highly correlated with rock inthe MajorMiner data. Such correlations could helpassociate tags with no apparent sonic relevance to tagsthat have a firmer acoustic foundation.

5. Conclusions

We have presented a game for collecting objective,specific descriptions of musical excerpts. Playing thegame is, in our estimation, fun, interesting, and thoughtprovoking. Preliminary data collection has shown that itis useful for gathering relevant, specific data and thatplayers agree on many characteristics of clips of music.Experiments show that these tags are useful for trainingautomatic music description algorithms, more so thansocial tags from a popular website.

We believe that audio classifiers are more successful atlearning the MajorMiner tags because they are moreclosely tied to the sound of the music itself. Social tags,on the other hand, also include many non-musicalfactors, especially when applied to artists and albums.Even though they are more noisy, however, social musicwebsites like Last.fm have proven very popular and haveamassed billions of taggings, more than a game likeMajorMiner could ever hope to collect. This trade-offbetween tag quality and quantity implies that the twosources of data are useful in different circumstances. Forexample, they could be combined in a hybrid approachthat builds noise robustness on the high quality databefore exploring the noisy, high quantity data.

There is much that still remains to be done with thissystem. Among other things, we would like to investigateways to combine audio-based and word-based musicsimilarity to help improve both, to use automaticdescriptions as features for further manipulation, toinvestigate an anchor space built from the data collectedhere, and to use descriptions of clips to help determinethe structure of songs.

Acknowledgements

The authors would like to thank Johanna Devaney forher help and Douglas Turnbull, Youngmoo Kim, andEdith Law for information about their games. This workwas supported by the Fu Foundation School ofEngineering and Applied Science via a PresidentialFellowship, by the Columbia Academic Quality Fund,and by the National Science Foundation (NSF) underGrants IIS-0238301 and IIS-0713334. Any opinions,findings and conclusions or recommendations expressed

in this material are those of the authors and do notnecessarily reflect the views of the NSF.

References

Barnard, K., Duygulu, P., Forsyth, D., De Freitas, N., Blei,D.M. & Jordan, M.I. (2003). Matching words andpictures. Journal of Machine Learning Research, 3(6),1107–1135.

Berenzweig, A., Ellis, D.P.W. & Lawrence, S. (2003).Anchor space for classification and similarity measure-ment of music. In Proceedings of the InternationalConference on Multimedia and Expo (ICME), Baltimore,USA.

Carneiro, G. & Vasconcelos, N. (2005). Formulating semanticimage annotation as a supervised learning problem.Computer Vision and Pattern Recognition, 2, 163–168.

Downie, S.J., West, K., Ehmann, A. & Vincent, E. (2005).The 2005 music information retrieval evaluation ex-change (MIREX 2005): preliminary overview. In J.D.Reiss and G.A. Wiggins (Eds.), Proceedings of theInternational Symposium on Music Information Retrieval(pp. 320–323). London, UK.

Eck, D., Lamere, P., Mahieux, T.B. & Green, S. (2008).Automatic generation of social tags for music recom-mendation. In J.C. Platt, D. Koller, Y. Singer and S.Roweis (Eds.), Advances in neural information processingsystems 20. Cambridge, MA: MIT Press.

Ellis, D.P.W. Whitman, B., Berenzweig, A. & Lawrence, S.(2002). The quest for ground truth in musical artistsimilarity. In Proceedings of the International Symposiumon Music Information Retrieval, Paris, France.

Kim, Y.E., Schmidt, E. & Emelle, L. (2008). MoodSwings:a collaborative game for music mood label collection.In Proceedings of the International Symposium onMusic Information Retrieval (pp. 231–236). Philadelphia,USA.

Law, E.L.M., von Ahn, L., Dannenberg, R.B. & Crawford,M. (2007). TagATune: a game for music and soundannotation. In S. Dixon, D. Bainbridge and R. Typke(Eds.), Proceedings of the International Symposium onMusic Information Retrieval (pp. 361–364). Vienna,Austria.

Lidy, T., Rauber, A., Pertusa, A. & Inesta, J.M. (2007).Improving genre classification by combination of audioand symbolic descriptors using a transcription system. InS. Dixon, D. Bainbridge and R. Typke (Eds.), Proceed-ings of the International Symposium on Music InformationRetrieval, Vienna, Austria.

Mandel, M.I. & Ellis, D.P.W. (2005). Song-level featuresand support vector machines for music classification. InJ.D. Reiss and G.A. Wiggins (Eds.), Proceedings of theInternational Symposium on Music Information Retrieval(pp. 594–599). London, UK.

Mandel, M.I. & Ellis, D.P.W. (2008). Multiple-instancelearning for music information retrieval. In Proceedingsof the International Symposium on Music InformationRetrieval (pp. 577–582). Philadelphia, USA.

164 Michael I. Mandel and Daniel P.W. Ellis

Downloaded By: [Columbia University] At: 22:09 1 June 2009

Page 16: PLEASE SCROLL DOWN FOR ARTICLE - Columbia …dpwe/pubs/MandelE08-majorminer.pdf · easy, useful, and objective. Participants describe 10 second clips of songs and score points when

Pachet, F. & Roy, P. (2007). Exploring billions of audiofeatures. In International Workshop on Content-BasedMultimedia Indexing (CBMI), pp. 227–235.

Thayer, R.E. (1990). The biopsychology of mood and arousal.USA: Oxford University Press.

Turnbull, D., Barrington, L.&Lanckriet, G. (2006).Modelingmusic and words using a multi-class naive bayes approach.In Proceedings of the International Symposium on MusicInformation Retrieval, Victoria, Canada.

Turnbull, D., Barrington, L., Torres, D. & Lanckriet, G.(2007a). Towards musical query-by- semantic descriptionusing the CAL500 data set. In ACM Special InterestGroup on Information Retrieval (SIGIR), Amsterdam,The Netherlands.

Turnbull, D., Liu, R., Barrington, L. & Lanckriet, G.(2007b). A game-based approach for collecting semanticannotations of music. In S. Dixon, D. Bainbridge andR. Typke (Eds.), Proceedings of the InternationalSymposium on Music Information Retrieval (pp. 535–538). Vienna, Austria.

von Ahn, L. & Dabbish, L. (2004). Labeling images with acomputer game. In Proceedings of the SIGCHI Con-ference on Human Factors in Computing Systems (pp.319–326). Vienna, Austria.

Whitman, B. & Ellis, D. (2004). Automatic record reviews.In Proceedings of the International Symposium on MusicInformation Retrieval, Barcelona, Spain.

Whitman, B. & Rifkin, R. (2002). Musical query-by-description as a multiclass learning problem. In IEEEWorkshop on Multimedia Signal Processing (pp. 153–156). St. Thomas, US Virgin Islands.

Appendix A: Derivation of clip seen distribution

Section 2.2 sets out a simple rule for choosing clips topresent to a player: with probability g a clip is selectedthat no other player has seen is selected and with

probability 17g the scorable clip that has been seen bythe fewest other players is selected. The way in which ggrows with user experience induces a distribution overthe number of times a clip will be seen. In designing thisgrowth, it is useful to have a model of this relationshipand we can create this model from a probabilisticformulation of user experience and past usage data fromthe game.

To perform this calculation, we define a Markov chainwith states n 2 N. Clips move around in this state spacesuch that a clip’s state represents the number of usersthat have seen that clip. Let xn be the number of clipsthat are in state n, i.e. the number of clips that have beenseen n times. The system has x0¼? and begins with allother xn¼ 0. When player i requests a clip, a coin isflipped and with probability gi the player is given a newclip, which moves from state 0 to state 1. Otherwise theplayer is given the scorable clip that has been seen by thefewest other players, moving it from n* to n*þ1 wheren*40 is the lowest populated state.

Assume for the moment that all users have the sameprobability of receiving a new clip, i.e. gi¼ g, 8i. Then atequilibrium, only two states are occupied, �n: b1/gc and�nþ 1. The occupancy of state �n is x�n¼N[�n7 ((17g)/g)]and the occupancy of state �nþ 1 is x�nþ 1¼N7 x�n. Thisholds true even when g¼E[gi], where the expectation istaken over the true distribution of gi values seen in thegame. In this case, at equilibrium, �n¼b1/E[gi]c and allclips will have been seen either this many times or oncemore. In quantifying a user’s experience level, we aredefining the function gi¼ g(ci), where ci is the number ofclips player i has already heard.

Because of the refractory period that prevents twopeople from seeing the same clip in rapid succession, thereality of the game deviates from this idealized model.The analysis does, however, describe the gross effects ofthese picking strategies.

A web-based game for collecting music metadata 165

Downloaded By: [Columbia University] At: 22:09 1 June 2009