audio music similarity and retrieval: evaluation power and stability

68
ISMIR 2011 Miami, USA · October 26th Picture by Michael Shane Audio Music Similarity and Retrieval: Evaluation Power and Stability Julián Urbano @julian_urbano Diego Martín, Mónica Marrero and Jorge Morato University Carlos III of Madrid

Upload: julian-urbano

Post on 09-Jun-2015

313 views

Category:

Technology


0 download

DESCRIPTION

In this paper we analyze the reliability of the results in the evaluation of Audio Music Similarity and Retrieval systems. We focus on the power and stability of the evaluation, that is, how often a significant difference is found between systems and how often these significant differences are incorrect. We study the effect of using different effectiveness measures with different sets of relevance judgments, for varying number of queries and alternative statistical procedures. Different measures are shown to behave similarly overall, though some are much more sensitive and stable than others. The use of different statistical procedures does improve the reliability of the results, and it allows using as little as half the number of queries currently used in MIREX evaluations while still offering very similar reliability levels. We also conclude that experimenters can be very confident that if a significant difference is found between two systems, the difference is indeed real.

TRANSCRIPT

Page 1: Audio Music Similarity and Retrieval: Evaluation Power and Stability

ISMIR 2011Miami, USA · October 26thPicture by Michael Shane

Audio Music Similarity and Retrieval:

Evaluation Power and StabilityJulián Urbano @julian_urbano

Diego Martín, Mónica Marrero and Jorge MoratoUniversity Carlos III of Madrid

Page 2: Audio Music Similarity and Retrieval: Evaluation Power and Stability

AMS

retrieve audio clips

musically similar

to a query clip

Page 3: Audio Music Similarity and Retrieval: Evaluation Power and Stability

grand results(MIREX 2009)

Page 4: Audio Music Similarity and Retrieval: Evaluation Power and Stability

grand results(MIREX 2009)I won!I won!I won!I won!

but the difference is not significant…is not significant…is not significant…is not significant…

yeah, it’s not significant!

oh, come on! it‘s so close!so close!so close!so close!

Page 5: Audio Music Similarity and Retrieval: Evaluation Power and Stability

grand results(MIREX 2009)I won!I won!I won!I won!

but the difference is not significant…is not significant…is not significant…is not significant…

yeah, it’s not significant!

did you hear?

shut up… we are!we are!we are!we are!

oh, come on! it‘s so close!so close!so close!so close!

Page 6: Audio Music Similarity and Retrieval: Evaluation Power and Stability

grand results(MIREX 2009)I won!I won!I won!I won!

but the difference is not significant…is not significant…is not significant…is not significant…

yeah, it’s not significant!

did you hear?

damn it!

don‘t worry don‘t worry don‘t worry don‘t worry about it

shut up… we are!we are!we are!we are!

oh, come on! it‘s so close!so close!so close!so close!

Page 7: Audio Music Similarity and Retrieval: Evaluation Power and Stability

Picture by Sara A. Beyer

what does it mean?

Page 8: Audio Music Similarity and Retrieval: Evaluation Power and Stability

proper interpretation of p-values

H0: mean score of system A = mean score of B

H1: mean scores are different

B A

a statistical test returns p<0.01, so we conclude A >> B

Page 9: Audio Music Similarity and Retrieval: Evaluation Power and Stability

proper interpretation of p-values

H0: mean score of system A = mean score of B

H1: mean scores are different

B Ait means that if we assume Hassume Hassume Hassume H0000and repeatrepeatrepeatrepeat the experiment, there is a <0.01 probabilityof having these result having these result having these result having these result again*

*or one even more extreme

a statistical test returns p<0.01, so we conclude A >> B

Page 10: Audio Music Similarity and Retrieval: Evaluation Power and Stability

MIREX 2010

system A is better than B, but it’s

not statistically significant

we can expect anything

with a different collection

this evaluationis not powerfulnot powerfulnot powerfulnot powerful

MIREX 2009

conclusions about general behavior

A ? BA > B

A is better than B, and it’s

statistically significant

A >> B we expect the same:

A is significantly better than B

A >> B

…and stablestablestablestable

but these could also happen:

A > B or A < B or A << B

this oneis powerful…is powerful…is powerful…is powerful…

lack of power lack of power lack of power lack of power in MIREX 2010minorminorminorminor stability conflict

majormajormajormajor stability conflict

Page 11: Audio Music Similarity and Retrieval: Evaluation Power and Stability

it‘s all about reliability

Page 12: Audio Music Similarity and Retrieval: Evaluation Power and Stability

on the shoulders of giantsIsaac Newton

Page 13: Audio Music Similarity and Retrieval: Evaluation Power and Stability

Text REtrieval Conference

1% to 14% of comparisons show stability conflicts

~25% differences to ensure <5% conflicts with 50 queries

[Buckley and Voorhees, 2000]depends on the measure used

nononono significancesignificancesignificancesignificance testing

improved reliability with pairwise t-tests

virtually no conflicts if >10% differences with significance

[Sanderson and Zobel, 2005] others werenot as good

with many queries, even significance is unreliable

[Voorhees, 2009]

major review: other collections and more recent measures

some measures are much better than others

[Sakai, 2007]

sensitivitysensitivitysensitivitysensitivity

efforteffortefforteffort

does not mean they should not be used!

Page 14: Audio Music Similarity and Retrieval: Evaluation Power and Stability

Music Similarity and Retrieval

alternative forms of ground truth for SMS

reliable and comprehensive but too expensive

[Typke et al., 2005][Urbano et al., 2010]

more about thisin 30 mins

no prefixedrelevance scale

specific measure for the task

[Typke et al., 2006]

despite high agreement, evaluation does change…evaluation does change…evaluation does change…evaluation does change…

agreement between judgments by different people

propose to use more queries

[Jones et al., 2007]

cheaper judgments via crowdsourcing seems reliable

[Urbano et al., 2010][Lee, 2010]

many other things

[Urbano, 2011]

Page 15: Audio Music Similarity and Retrieval: Evaluation Power and Stability

it‘s actually about the

effort-reliability tradeoff

Page 16: Audio Music Similarity and Retrieval: Evaluation Power and Stability

it‘s actually about the

effort-reliability tradeoff

task# of queries

relevance judgmentsmeasures

statistical methods

# of systemssystem similarity

Page 17: Audio Music Similarity and Retrieval: Evaluation Power and Stability

Picture by Wessex Archaeology

measures

&

judgments

Page 18: Audio Music Similarity and Retrieval: Evaluation Power and Stability

how much information does the user gain?

results as a set

AG@5: Average Gain in the top 5 documents

results as a list

NDCG@5: Normalized Discounted Cumulated Gain

ANDCG@5: Average NDCG across ranks

ADR@5: Average Dynamic Recall

measure used in MIREX(with different name)

more realisticuser modeluser modeluser modeluser model

best documents firstbest documents firstbest documents firstbest documents first,and the lower the rank

the lower the gain**details in the paper

Page 19: Audio Music Similarity and Retrieval: Evaluation Power and Stability

how much information does a result provide?

BROAD relevance judgments

not similar = 0

somewhat similar = 1

very similar = 2

FINE relevance judgments

real-valued, from 0 to 10 or 100

Page 20: Audio Music Similarity and Retrieval: Evaluation Power and Stability

look at MIREX 2009

largest evaluation until 2011

Page 21: Audio Music Similarity and Retrieval: Evaluation Power and Stability

Picture by Roger Green

power

Page 22: Audio Music Similarity and Retrieval: Evaluation Power and Stability

% of pairwise comparisons that are significant

what's the effect of:number of queries

relevance judgments

effectiveness measures

Page 23: Audio Music Similarity and Retrieval: Evaluation Power and Stability

% of pairwise comparisons that are significant

what's the effect of:number of queries

relevance judgments

effectiveness measures

all 100 queries set

Page 24: Audio Music Similarity and Retrieval: Evaluation Power and Stability

% of pairwise comparisons that are significant

what's the effect of:number of queries

relevance judgments

effectiveness measures

5 query

subset

all 100 queries set

random sample

Page 25: Audio Music Similarity and Retrieval: Evaluation Power and Stability

% of pairwise comparisons that are significant

what's the effect of:number of queries

relevance judgments

effectiveness measures

5 query

subset

all 100 queries set

evaluation

Broad judgments

Fine judgments

random sample

# queries

% s

ign

ific

an

t

# queries

% s

ign

ific

an

t

Page 26: Audio Music Similarity and Retrieval: Evaluation Power and Stability

% of pairwise comparisons that are significant

what's the effect of:number of queries

relevance judgments

effectiveness measures

5 query

subset

all 100 queries set

evaluation

Broad judgments

Fine judgments

random sample

repeat 500 times repeat 500 times repeat 500 times repeat 500 times for 5 query subsetsto minimize random effectsrandom effectsrandom effectsrandom effects

52,500 system

comparisons# queries

% s

ign

ific

an

t

# queries

% s

ign

ific

an

t

Page 27: Audio Music Similarity and Retrieval: Evaluation Power and Stability

% of pairwise comparisons that are significant

what's the effect of:number of queries

relevance judgments

effectiveness measures

all 100 queries set10 query

subset

Broad judgments

Fine judgments

repeat another 500 times another 500 times another 500 times another 500 times for 10 query subsets

evaluation

52,500 system

comparisons# queries

% s

ign

ific

an

t

# queries

% s

ign

ific

an

t

Page 28: Audio Music Similarity and Retrieval: Evaluation Power and Stability

% of pairwise comparisons that are significant

what's the effect of:number of queries

relevance judgments

effectiveness measures

all 100 queries set10 query

subset

Broad judgments

Fine judgments

stratifiedstratifiedstratifiedstratified random samplingwith equal priorsequal priorsequal priorsequal priors

barroque

blues

classical

country

edance

jazz

metal

rap-hiphop

rock&roll

romantic

balanced across 10 genres

evaluation# queries

% s

ign

ific

an

t

# queries

% s

ign

ific

an

t

Page 29: Audio Music Similarity and Retrieval: Evaluation Power and Stability

% of pairwise comparisons that are significant

what's the effect of:number of queries

relevance judgments

effectiveness measures

all 100 query subset

# queries

% s

ign

ific

an

t

Broad judgments

# queries

% s

ign

ific

an

t

Fine judgments

evaluation

Page 30: Audio Music Similarity and Retrieval: Evaluation Power and Stability

we simulate possible

evaluation scenarios

Page 31: Audio Music Similarity and Retrieval: Evaluation Power and Stability

power results (larger is better)

power inMIREX 2009

Broad judgments

Query set size

% S

ign

ific

an

t co

mp

ari

son

s

40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64

AGNDCGANDCGADR

Fine judgments

Query set size

% S

ign

ific

an

t co

mp

ari

son

s40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64

Page 32: Audio Music Similarity and Retrieval: Evaluation Power and Stability

power results (larger is better)

power inMIREX 2009

similar logarithmic trend similar logarithmic trend similar logarithmic trend similar logarithmic trend except for ADRFine (expected)

Broad judgments

Query set size

% S

ign

ific

an

t co

mp

ari

son

s

40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64

AGNDCGANDCGADR

Fine judgments

Query set size

% S

ign

ific

an

t co

mp

ari

son

s40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64

Page 33: Audio Music Similarity and Retrieval: Evaluation Power and Stability

power results (larger is better)

power inMIREX 2009

similar logarithmic trend similar logarithmic trend similar logarithmic trend similar logarithmic trend except for ADRFine (expected)

same powersame powersame powersame powerwith with with with 70% 70% 70% 70% effort!effort!effort!effort!

only 2 significant pairs missed with 70% effort

(probably unstable)(probably unstable)(probably unstable)(probably unstable)

Broad judgments

Query set size

% S

ign

ific

an

t co

mp

ari

son

s

40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64

AGNDCGANDCGADR

Fine judgments

Query set size

% S

ign

ific

an

t co

mp

ari

son

s40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64

Page 34: Audio Music Similarity and Retrieval: Evaluation Power and Stability

merely using more queries

does not pay offwhen looking for power

Page 35: Audio Music Similarity and Retrieval: Evaluation Power and Stability

Picture by Dave Hunt

stability

Page 36: Audio Music Similarity and Retrieval: Evaluation Power and Stability

% of pairwise comparisons that are conflicting

what's the effect of:number of queries

relevance judgments

effectiveness measures

Page 37: Audio Music Similarity and Retrieval: Evaluation Power and Stability

% of pairwise comparisons that are conflicting

what's the effect of:number of queries

relevance judgments

effectiveness measures5 query

subset

all 100 queries set

barroque

blues

classical

country

edance

jazz

metal

rap-hiphop

rock&roll

romantic

Page 38: Audio Music Similarity and Retrieval: Evaluation Power and Stability

% of pairwise comparisons that are conflicting

what's the effect of:number of queries

relevance judgments

effectiveness measures5 query

subset

all 100 queries set

barroque

blues

classical

country

edance

jazz

metal

rap-hiphop

rock&roll

romantic

5 query

subset

independent samples

Page 39: Audio Music Similarity and Retrieval: Evaluation Power and Stability

% of pairwise comparisons that are conflicting

what's the effect of:number of queries

relevance judgments

effectiveness measures5 query

subset

all 100 queries set

evaluation

#queries

% c

on

flic

tin

g

Broad judgments

#queries

% c

on

flic

tin

g

Fine judgments

barroque

blues

classical

country

edance

jazz

metal

rap-hiphop

rock&roll

romantic

5 query

subset

evaluation

independent samples

Page 40: Audio Music Similarity and Retrieval: Evaluation Power and Stability

% of pairwise comparisons that are conflicting

what's the effect of:number of queries

relevance judgments

effectiveness measures5 query

subset

all 100 queries set

evaluation

#queries

% c

on

flic

tin

g

Broad judgments

#queries

% c

on

flic

tin

g

Fine judgments

52,500crosscrosscrosscross----collectioncollectioncollectioncollection

system comparisons

barroque

blues

classical

country

edance

jazz

metal

rap-hiphop

rock&roll

romantic

5 query

subset

evaluation

independent samples

repeat 500 timesrepeat 500 timesrepeat 500 timesrepeat 500 timesto minimize random effectsrandom effectsrandom effectsrandom effects

Page 41: Audio Music Similarity and Retrieval: Evaluation Power and Stability

% of pairwise comparisons that are conflicting

what's the effect of:number of queries

relevance judgments

effectiveness measures

evaluation

#queries

% c

on

flic

tin

g

Broad judgments

#queries

% c

on

flic

tin

g

Fine judgments

evaluation

with 100 total queries we can’t go beyond 50

50 query subset

50 query subset

Page 42: Audio Music Similarity and Retrieval: Evaluation Power and Stability

we simulate comparisons

across possible collections

Page 43: Audio Music Similarity and Retrieval: Evaluation Power and Stability

stability results (lower is better)

Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22 AG

NDCGANDCGADR

Broad judgments

Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22

Fine judgments

stability inMIREX 2009

Page 44: Audio Music Similarity and Retrieval: Evaluation Power and Stability

stability results (lower is better)

Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22 AG

NDCGANDCGADR

Broad judgments

Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22

Fine judgments

stability inMIREX 2009

lack of powerlack of powerlack of powerlack of power in one collection but not in the other

Page 45: Audio Music Similarity and Retrieval: Evaluation Power and Stability

stability results (lower is better)

Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22 AG

NDCGANDCGADR

Broad judgments

Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22

Fine judgments

stability inMIREX 2009

lack of powerlack of powerlack of powerlack of power in one collection but not in the otherADR takes longer

to converge

Page 46: Audio Music Similarity and Retrieval: Evaluation Power and Stability

stability results (lower is better)

Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22 AG

NDCGANDCGADR

Broad judgments

Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22

Fine judgments

stability inMIREX 2009

lack of powerlack of powerlack of powerlack of power in one collection but not in the other

converge to <5% for >40 queries converge to <5% for >40 queries converge to <5% for >40 queries converge to <5% for >40 queries (consistent with α=0.05)

ADR takes longerto converge

Page 47: Audio Music Similarity and Retrieval: Evaluation Power and Stability

merely using more queries

does not pay offwhen looking for stability

Page 48: Audio Music Similarity and Retrieval: Evaluation Power and Stability

type of conflicts (50 queries)

measure conflictsA>B

(power)

A<B

(minor)

A<<B

(major)B

roa

d

AG 3.36% 100% 0% 0%

NDCG 3.77% 99.90% 0.10% 0%

ANDCG 4.73% 99.96% 0.04% 0%

ADR 9.03% 99.94% 0.06% 0%

Fin

e

AG 2.64% 99.86% 0.14% 0%

NDCG 2.94% 99.74% 0.26% 0%

ANDCG 4.03% 99.91% 0.09% 0%

ADR 19.08% 99.50% 0.50% 0%

virtually all virtually all virtually all virtually all conflicts due to lack of power in one collection

no major conflictno major conflictno major conflictno major conflictwhatsoeverwhatsoeverwhatsoeverwhatsoever

Page 49: Audio Music Similarity and Retrieval: Evaluation Power and Stability

if significance shows up

it most probably is correct

are we being too conservative?

Page 50: Audio Music Similarity and Retrieval: Evaluation Power and Stability

Milton Friedman

statistics

John TukeyFrank Wilcoxon

Page 51: Audio Music Similarity and Retrieval: Evaluation Power and Stability

compare two systems

is the difference significant?

t-test, Wilcoxon test, sign test, etc.

significance level α

probability of Type I error

(finding a significant difference when there is none)

usually, α=0.05 or α=0.01

5% or 1% of my significant results are just wrong

stability conflictthey make

different assumptions

Page 52: Audio Music Similarity and Retrieval: Evaluation Power and Stability

compare several systems

15 systems = 105 comparisons

experiment-wide significance level = 1-(1-α)105 = 0.995

we can expect at least one significant comparison to be wrong

instead, compare all systems at once

ANOVA, Friedman test, Kruskal-Wallis, etc.

correct p-values to keep experiment-wide significance level <0.05

Tukey’s HSD, Bonferroni, Scheffe, Duncan, Newman-Keuls, etc.

used in MIREX(with different assumptions)(with different assumptions)(with different assumptions)(with different assumptions)

MIREX 2009

Page 53: Audio Music Similarity and Retrieval: Evaluation Power and Stability

more stability

at the cost of

less power

is it worth it?

Page 54: Audio Music Similarity and Retrieval: Evaluation Power and Stability

what a MIREX participant wants

compare my system with the other 14

comparisons between those 14 are uninteresting

subexperiment: only 14 pairwise comparisons, not 105

get back the power missed by considering the other 91

compare all systems with 1-tailed Wilcoxon tests at α=0.01

experiment-wide significant level = 1-(1-0.01)105 = 0.652

subexperiment-wide significant level = 1-(1-0.01)14 = 0.131

should throw out more conflicts toonumber of comparisons grows linearly with number of systems

subexperiment-wide significant level = 1-(1-α)14 = 0.512

Page 55: Audio Music Similarity and Retrieval: Evaluation Power and Stability

power results (larger is better)

Broad judgments

AGNDCGANDCGADR

Fine judgments

Query set size

% S

ign

ific

an

t co

mp

ari

son

s

40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64

Query set size

% S

ign

ific

an

t co

mp

ari

son

s40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64

Friedman+Tukey(as in MIREX)

Page 56: Audio Music Similarity and Retrieval: Evaluation Power and Stability

power results (larger is better)

Broad judgments

AGNDCGANDCGADR

Fine judgments

Query set size

% S

ign

ific

an

t co

mp

ari

son

s

40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64

Query set size

% S

ign

ific

an

t co

mp

ari

son

s40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64

Friedman+Tukey(as in MIREX)

all 1-tailed Wilcoxon comparisonsis up to %20 more powerful up to %20 more powerful up to %20 more powerful up to %20 more powerful than Friedman+Tukey

Page 57: Audio Music Similarity and Retrieval: Evaluation Power and Stability

power results (larger is better)

Broad judgments

AGNDCGANDCGADR

Fine judgments

Query set size

% S

ign

ific

an

t co

mp

ari

son

s

40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64

Query set size

% S

ign

ific

an

t co

mp

ari

son

s40 45 50 55 60 65 70 75 80 85 90 95 100

46

48

50

52

54

56

58

60

62

64

Friedman+Tukey(as in MIREX)

same powersame powersame powersame powerwith 50with 50with 50with 50% % % % effort!effort!effort!effort!

all 1-tailed Wilcoxon comparisonsis up to %20 more powerful up to %20 more powerful up to %20 more powerful up to %20 more powerful than Friedman+Tukey

Page 58: Audio Music Similarity and Retrieval: Evaluation Power and Stability

stability results (lower is better)

Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22 AG

NDCGANDCGADR

Broad judgments

Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22

Fine judgments

earlier convergencebecause of increased power

Page 59: Audio Music Similarity and Retrieval: Evaluation Power and Stability

stability results (lower is better)

Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22 AG

NDCGANDCGADR

Broad judgments

Query subset size

% C

on

flic

tin

g c

om

pa

riso

ns

5 10 15 20 25 30 35 40 45 50

24

68

10

12

14

16

18

20

22

Fine judgments

earlier convergencebecause of increased power

AG converges againagainagainagain to 3-4%(A)NDCG converge to 5-6%

Page 60: Audio Music Similarity and Retrieval: Evaluation Power and Stability

type of conflicts (50 queries)

measure conflictsA>B

(power)

A<B

(minor)

A<<B

(major)B

roa

d

AG 3.68% 96.32% 3.68% 0%

NDCG 5.05% 96.82% 3.18% 0%

ANDCG 6.08% 96.84% 3.13% 0.03%

ADR 5.93% 95.12% 4.88% 0%

Fin

e

AG 3.32% 98.34% 1.66% 0%

NDCG 6.58% 96.61% 3.39% 0%

ANDCG 6.44% 94.94% 5.06% 0%

ADR 12.48% 90.58% 9.37% 0.05%

again, again, again, again, due tolack of power in one collection no major conflicts

within knownType III error rates

Page 61: Audio Music Similarity and Retrieval: Evaluation Power and Stability

effort-reliability tradeoff

Friedman+Tukey with 100 queries 1-tailed Wilcoxon with 50 queries

measure power - conflicts = stable power - conflicts = stable

Bro

ad

AG 57.14% - 3.64% = 53.50% 55.10% - 3.68% = 51.42%

NDCG 57.14% - 4.08% = 53.06% 57.01% - 5.05% = 51.96%

ANDCG 57.14% - 4.19% = 52.95% 57.37% - 6.08% = 51.29%

ADR 56.19% - 7.13% = 49.06% 57.30% - 5.93% = 51.37%

Fin

e

AG 54.29% - 3.20% = 51.09% 54.31% - 3.32% = 50.99%

NDCG 56.19% - 3.04% = 53.15% 57.56% - 6.58% = 50.98%

ANDCG 56.19% - 2.96% = 53.23% 57.38% - 6.44% = 50.94%

ADR 56.19% - 19.97% = 36.22% 55.03% - 12.48% = 42.55%

vvvvirtually same reliability with half the effort!irtually same reliability with half the effort!irtually same reliability with half the effort!irtually same reliability with half the effort!

Page 62: Audio Music Similarity and Retrieval: Evaluation Power and Stability

Friedman-Tukey requires

too much effort

Page 63: Audio Music Similarity and Retrieval: Evaluation Power and Stability

my point?

Page 64: Audio Music Similarity and Retrieval: Evaluation Power and Stability

Do not attempt to accomplish greater results

by a greater effort of your little understanding,

but by a greater understanding of your little effort.

Walter Russell

Page 65: Audio Music Similarity and Retrieval: Evaluation Power and Stability

if significance shows up it most probably is trueat worst, conflicts are due to lack of power

using more and more queries is pointlesstoo much effort for the small gain in power and stability

using different similarity scales has little effectusing only one is probably just fine

some effectiveness measures are better than othersthey should still be used: they measure different things

but bear in mind their power and stability

some statistical methods are better than othersvirtually same realiability with half the effort

Page 66: Audio Music Similarity and Retrieval: Evaluation Power and Stability

Picture by Ronny Welter

Page 67: Audio Music Similarity and Retrieval: Evaluation Power and Stability

reduce the judging effortmore queries in Symbolic Melodic Similarity

reliable low-cost in-house evaluations and Crowdsourcing

other collections, tasks and measures

deeper evaluation cutoffsnot just the top 5 documents: pay attention to ranking

probably more reliable, and certainly more reusable

effect of the number of systemsspecially if developed by the same research group

forget about power and worry about effect-sizeeventually, significance becomes meaningless

other statistical methodsMultiple Comparisons with a Control (baseline)

Page 68: Audio Music Similarity and Retrieval: Evaluation Power and Stability

guide experimenters in

the interpretationof the results and the

tradeoff between

effort and reliability