audio music similarity and retrieval: evaluation power and stability

ISMIR 2011Miami, USA · October 26thPicture by Michael Shane

Audio Music Similarity and Retrieval:

Evaluation Power and StabilityJulián Urbano @julian_urbano

Diego Martín, Mónica Marrero and Jorge MoratoUniversity Carlos III of Madrid

AMS

retrieve audio clips

musically similar

to a query clip

grand results(MIREX 2009)

grand results(MIREX 2009)I won!I won!I won!I won!

but the difference is not significant…is not significant…is not significant…is not significant…

yeah, it’s not significant!

oh, come on! it‘s so close!so close!so close!so close!




did you hear?

shut up… we are!we are!we are!we are!





did you hear?

damn it!

don‘t worry don‘t worry don‘t worry don‘t worry about it

shut up… we are!we are!we are!we are!


Picture by Sara A. Beyer

what does it mean?

proper interpretation of p-values

H0: mean score of system A = mean score of B

H1: mean scores are different

B A

a statistical test returns p<0.01, so we conclude A >> B

proper interpretation of p-values

H0: mean score of system A = mean score of B

H1: mean scores are different

B Ait means that if we assume Hassume Hassume Hassume H0000and repeatrepeatrepeatrepeat the experiment, there is a <0.01 probabilityof having these result having these result having these result having these result again*

*or one even more extreme

a statistical test returns p<0.01, so we conclude A >> B

MIREX 2010

system A is better than B, but it’s

not statistically significant

we can expect anything

with a different collection

this evaluationis not powerfulnot powerfulnot powerfulnot powerful

MIREX 2009

conclusions about general behavior

A ? BA > B

A is better than B, and it’s

statistically significant

A >> B we expect the same:

A is significantly better than B

A >> B

…and stablestablestablestable

but these could also happen:

A > B or A < B or A << B

this oneis powerful…is powerful…is powerful…is powerful…

lack of power lack of power lack of power lack of power in MIREX 2010minorminorminorminor stability conflict

majormajormajormajor stability conflict

it‘s all about reliability

on the shoulders of giantsIsaac Newton

Text REtrieval Conference

1% to 14% of comparisons show stability conflicts

~25% differences to ensure <5% conflicts with 50 queries

[Buckley and Voorhees, 2000]depends on the measure used

nononono significancesignificancesignificancesignificance testing

improved reliability with pairwise t-tests

virtually no conflicts if >10% differences with significance

[Sanderson and Zobel, 2005] others werenot as good

with many queries, even significance is unreliable

[Voorhees, 2009]

major review: other collections and more recent measures

some measures are much better than others

[Sakai, 2007]

sensitivitysensitivitysensitivitysensitivity

efforteffortefforteffort

does not mean they should not be used!

Music Similarity and Retrieval

alternative forms of ground truth for SMS

reliable and comprehensive but too expensive

[Typke et al., 2005][Urbano et al., 2010]

more about thisin 30 mins

no prefixedrelevance scale

specific measure for the task

[Typke et al., 2006]

despite high agreement, evaluation does change…evaluation does change…evaluation does change…evaluation does change…

agreement between judgments by different people

propose to use more queries

[Jones et al., 2007]

cheaper judgments via crowdsourcing seems reliable

[Urbano et al., 2010][Lee, 2010]

many other things

[Urbano, 2011]

it‘s actually about the

effort-reliability tradeoff

it‘s actually about the


task# of queries

relevance judgmentsmeasures

statistical methods

# of systemssystem similarity

Picture by Wessex Archaeology

measures

&

judgments

how much information does the user gain?

results as a set

AG@5: Average Gain in the top 5 documents

results as a list

NDCG@5: Normalized Discounted Cumulated Gain

ANDCG@5: Average NDCG across ranks

ADR@5: Average Dynamic Recall

measure used in MIREX(with different name)

more realisticuser modeluser modeluser modeluser model

best documents firstbest documents firstbest documents firstbest documents first,and the lower the rank

the lower the gain**details in the paper

how much information does a result provide?

BROAD relevance judgments

not similar = 0

somewhat similar = 1

very similar = 2

FINE relevance judgments

real-valued, from 0 to 10 or 100

look at MIREX 2009

largest evaluation until 2011

Picture by Roger Green

power

% of pairwise comparisons that are significant

what's the effect of:number of queries

relevance judgments

effectiveness measures



relevance judgments


all 100 queries set



relevance judgments


5 query

subset

all 100 queries set

random sample



relevance judgments


5 query

subset

all 100 queries set

evaluation

Broad judgments

Fine judgments

random sample

# queries

% s

ign

ific

an

t

# queries

% s

ign

ific

an

t



relevance judgments


5 query

subset

all 100 queries set

evaluation

Broad judgments

Fine judgments

random sample

repeat 500 times repeat 500 times repeat 500 times repeat 500 times for 5 query subsetsto minimize random effectsrandom effectsrandom effectsrandom effects

52,500 system

comparisons# queries

% s

ign

ific

an

t

# queries

% s

ign

ific

an

t



relevance judgments


all 100 queries set10 query

subset

Broad judgments

Fine judgments

repeat another 500 times another 500 times another 500 times another 500 times for 10 query subsets

evaluation

52,500 system

comparisons# queries

% s

ign

ific

an

t

# queries

% s

ign

ific

an

t



relevance judgments


all 100 queries set10 query

subset

Broad judgments

Fine judgments

stratifiedstratifiedstratifiedstratified random samplingwith equal priorsequal priorsequal priorsequal priors

barroque

blues

classical

country

edance

jazz

metal

rap-hiphop

rock&roll

romantic

balanced across 10 genres

evaluation# queries

% s

ign

ific

an

t

# queries

% s

ign

ific

an

t



relevance judgments


all 100 query subset

# queries

% s

ign

ific

an

t

Broad judgments

# queries

% s

ign

ific

an

t

Fine judgments

evaluation

we simulate possible

evaluation scenarios

power results (larger is better)

power inMIREX 2009

Broad judgments

Query set size

% S

ign

ific

an

t co

mp

ari

son

s

40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64

AGNDCGANDCGADR

Fine judgments

Query set size

% S

ign

ific

an

t co

mp

ari

son

s40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64


power inMIREX 2009

similar logarithmic trend similar logarithmic trend similar logarithmic trend similar logarithmic trend except for ADRFine (expected)

Broad judgments

Query set size

% S

ign

ific

an

t co

mp

ari

son

s

40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64

AGNDCGANDCGADR

Fine judgments

Query set size

% S

ign

ific

an

t co

mp

ari

son

s40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64


power inMIREX 2009

similar logarithmic trend similar logarithmic trend similar logarithmic trend similar logarithmic trend except for ADRFine (expected)

same powersame powersame powersame powerwith with with with 70% 70% 70% 70% effort!effort!effort!effort!

only 2 significant pairs missed with 70% effort

(probably unstable)(probably unstable)(probably unstable)(probably unstable)

Broad judgments

Query set size

% S

ign

ific

an

t co

mp

ari

son

s

40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64

AGNDCGANDCGADR

Fine judgments

Query set size

% S

ign

ific

an

t co

mp

ari

son

s40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64

merely using more queries

does not pay offwhen looking for power

Picture by Dave Hunt

stability

% of pairwise comparisons that are conflicting


relevance judgments




relevance judgments

effectiveness measures5 query

subset

all 100 queries set

barroque

blues

classical

country

edance

jazz

metal

rap-hiphop

rock&roll

romantic



relevance judgments


subset

all 100 queries set

barroque

blues

classical

country

edance

jazz

metal

rap-hiphop

rock&roll

romantic

5 query

subset

independent samples



relevance judgments


subset

all 100 queries set

evaluation

#queries

% c

on

flic

tin

g

Broad judgments

#queries

% c

on

flic

tin

g

Fine judgments

barroque

blues

classical

country

edance

jazz

metal

rap-hiphop

rock&roll

romantic

5 query

subset

evaluation

independent samples



relevance judgments


subset

all 100 queries set

evaluation

#queries

% c

on

flic

tin

g

Broad judgments

#queries

% c

on

flic

tin

g

Fine judgments

52,500crosscrosscrosscross----collectioncollectioncollectioncollection

system comparisons

barroque

blues

classical

country

edance

jazz

metal

rap-hiphop

rock&roll

romantic

5 query

subset

evaluation

independent samples

repeat 500 timesrepeat 500 timesrepeat 500 timesrepeat 500 timesto minimize random effectsrandom effectsrandom effectsrandom effects



relevance judgments


evaluation

#queries

% c

on

flic

tin

g

Broad judgments

#queries

% c

on

flic

tin

g

Fine judgments

evaluation

with 100 total queries we can’t go beyond 50

50 query subset

50 query subset

we simulate comparisons

across possible collections

stability results (lower is better)

Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22 AG

NDCGANDCGADR

Broad judgments

Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22

Fine judgments

stability inMIREX 2009


Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22 AG

NDCGANDCGADR

Broad judgments

Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22

Fine judgments


lack of powerlack of powerlack of powerlack of power in one collection but not in the other


Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22 AG

NDCGANDCGADR

Broad judgments

Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22

Fine judgments


lack of powerlack of powerlack of powerlack of power in one collection but not in the otherADR takes longer

to converge


Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22 AG

NDCGANDCGADR

Broad judgments

Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22

Fine judgments


lack of powerlack of powerlack of powerlack of power in one collection but not in the other

converge to <5% for >40 queries converge to <5% for >40 queries converge to <5% for >40 queries converge to <5% for >40 queries (consistent with α=0.05)

ADR takes longerto converge

merely using more queries

does not pay offwhen looking for stability

type of conflicts (50 queries)

measure conflictsA>B

(power)

A<B

(minor)

A<<B

(major)B

roa

d

AG 3.36% 100% 0% 0%

NDCG 3.77% 99.90% 0.10% 0%

ANDCG 4.73% 99.96% 0.04% 0%

ADR 9.03% 99.94% 0.06% 0%

Fin

e

AG 2.64% 99.86% 0.14% 0%

NDCG 2.94% 99.74% 0.26% 0%

ANDCG 4.03% 99.91% 0.09% 0%

ADR 19.08% 99.50% 0.50% 0%

virtually all virtually all virtually all virtually all conflicts due to lack of power in one collection

no major conflictno major conflictno major conflictno major conflictwhatsoeverwhatsoeverwhatsoeverwhatsoever

if significance shows up

it most probably is correct

are we being too conservative?

Milton Friedman

statistics

John TukeyFrank Wilcoxon

compare two systems

is the difference significant?

t-test, Wilcoxon test, sign test, etc.

significance level α

probability of Type I error

(finding a significant difference when there is none)

usually, α=0.05 or α=0.01

5% or 1% of my significant results are just wrong

stability conflictthey make

different assumptions

compare several systems

15 systems = 105 comparisons

experiment-wide significance level = 1-(1-α)105 = 0.995

we can expect at least one significant comparison to be wrong

instead, compare all systems at once

ANOVA, Friedman test, Kruskal-Wallis, etc.

correct p-values to keep experiment-wide significance level <0.05

Tukey’s HSD, Bonferroni, Scheffe, Duncan, Newman-Keuls, etc.

used in MIREX(with different assumptions)(with different assumptions)(with different assumptions)(with different assumptions)

MIREX 2009

more stability

at the cost of

less power

is it worth it?

what a MIREX participant wants

compare my system with the other 14

comparisons between those 14 are uninteresting

subexperiment: only 14 pairwise comparisons, not 105

get back the power missed by considering the other 91

compare all systems with 1-tailed Wilcoxon tests at α=0.01

experiment-wide significant level = 1-(1-0.01)105 = 0.652

subexperiment-wide significant level = 1-(1-0.01)14 = 0.131

should throw out more conflicts toonumber of comparisons grows linearly with number of systems

subexperiment-wide significant level = 1-(1-α)14 = 0.512


Broad judgments

AGNDCGANDCGADR

Fine judgments

Query set size

% S

ign

ific

an

t co

mp

ari

son

s

40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64

Query set size

% S

ign

ific

an

t co

mp

ari

son

s40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64

Friedman+Tukey(as in MIREX)


Broad judgments

AGNDCGANDCGADR

Fine judgments

Query set size

% S

ign

ific

an

t co

mp

ari

son

s

40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64

Query set size

% S

ign

ific

an

t co

mp

ari

son

s40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64


all 1-tailed Wilcoxon comparisonsis up to %20 more powerful up to %20 more powerful up to %20 more powerful up to %20 more powerful than Friedman+Tukey


Broad judgments

AGNDCGANDCGADR

Fine judgments

Query set size

% S

ign

ific

an

t co

mp

ari

son

s

40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64

Query set size

% S

ign

ific

an

t co

mp

ari

son

s40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64


same powersame powersame powersame powerwith 50with 50with 50with 50% % % % effort!effort!effort!effort!

all 1-tailed Wilcoxon comparisonsis up to %20 more powerful up to %20 more powerful up to %20 more powerful up to %20 more powerful than Friedman+Tukey


Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22 AG

NDCGANDCGADR

Broad judgments

Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22

Fine judgments

earlier convergencebecause of increased power


Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22 AG

NDCGANDCGADR

Broad judgments

Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22

Fine judgments

earlier convergencebecause of increased power

AG converges againagainagainagain to 3-4%(A)NDCG converge to 5-6%

type of conflicts (50 queries)

measure conflictsA>B

(power)

A<B

(minor)

A<<B

(major)B

roa

d

AG 3.68% 96.32% 3.68% 0%

NDCG 5.05% 96.82% 3.18% 0%

ANDCG 6.08% 96.84% 3.13% 0.03%

ADR 5.93% 95.12% 4.88% 0%

Fin

e

AG 3.32% 98.34% 1.66% 0%

NDCG 6.58% 96.61% 3.39% 0%

ANDCG 6.44% 94.94% 5.06% 0%

ADR 12.48% 90.58% 9.37% 0.05%

again, again, again, again, due tolack of power in one collection no major conflicts

within knownType III error rates


Friedman+Tukey with 100 queries 1-tailed Wilcoxon with 50 queries

measure power - conflicts = stable power - conflicts = stable

Bro

ad

AG 57.14% - 3.64% = 53.50% 55.10% - 3.68% = 51.42%

NDCG 57.14% - 4.08% = 53.06% 57.01% - 5.05% = 51.96%

ANDCG 57.14% - 4.19% = 52.95% 57.37% - 6.08% = 51.29%

ADR 56.19% - 7.13% = 49.06% 57.30% - 5.93% = 51.37%

Fin

e

AG 54.29% - 3.20% = 51.09% 54.31% - 3.32% = 50.99%

NDCG 56.19% - 3.04% = 53.15% 57.56% - 6.58% = 50.98%

ANDCG 56.19% - 2.96% = 53.23% 57.38% - 6.44% = 50.94%

ADR 56.19% - 19.97% = 36.22% 55.03% - 12.48% = 42.55%

vvvvirtually same reliability with half the effort!irtually same reliability with half the effort!irtually same reliability with half the effort!irtually same reliability with half the effort!

Friedman-Tukey requires

too much effort

my point?

Do not attempt to accomplish greater results

by a greater effort of your little understanding,

but by a greater understanding of your little effort.

Walter Russell

if significance shows up it most probably is trueat worst, conflicts are due to lack of power

using more and more queries is pointlesstoo much effort for the small gain in power and stability

using different similarity scales has little effectusing only one is probably just fine

some effectiveness measures are better than othersthey should still be used: they measure different things

but bear in mind their power and stability

some statistical methods are better than othersvirtually same realiability with half the effort

Picture by Ronny Welter

reduce the judging effortmore queries in Symbolic Melodic Similarity

reliable low-cost in-house evaluations and Crowdsourcing

other collections, tasks and measures

deeper evaluation cutoffsnot just the top 5 documents: pay attention to ranking

probably more reliable, and certainly more reusable

effect of the number of systemsspecially if developed by the same research group

forget about power and worry about effect-sizeeventually, significance becomes meaningless

other statistical methodsMultiple Comparisons with a Control (baseline)

guide experimenters in

the interpretationof the results and the

tradeoff between

effort and reliability

audio music similarity and retrieval: evaluation power and stability

Technology

b system

mean score of system

mean scores

mean score of bh1

grand resultsmirex

oneis powerfula b

proper interpretation

different collection