towards minimal test collections for evaluation of audio music similarity and retrieval

43
AdMIRe 2012 Lyon, France · April 17th Picture by ERdi43 (Wikipedia) Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval @julian_urbano University Carlos III of Madrid @m_schedl Johannes Kepler University

Upload: julian-urbano

Post on 08-Jul-2015

536 views

Category:

Technology


2 download

DESCRIPTION

Reliable evaluation of Information Retrieval systems requires large amounts of relevance judgments. Making these annotations is quite complex and tedious for many Music Information Retrieval tasks, so performing such evaluations requires too much effort. A low-cost alternative is the application of Minimal Test Collection algorithms, which offer quite reliable results while significantly reducing the annotation effort. The idea is to incrementally select what documents to judge so that we can compute estimates of the effectiveness differences between systems with a certain degree of confidence. In this paper we show a first approach towards its application to the evaluation of the Audio Music Similarity and Retrieval task, run by the annual MIREX evaluation campaign. An analysis with the MIREX 2011 data shows that the judging effort can be reduced to about 35% to obtain results with 95% confidence.

TRANSCRIPT

Page 1: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

AdMIRe 2012 Lyon, France · April 17th Picture by ERdi43 (Wikipedia)

Towards Minimal Test Collections for Evaluation of

Audio Music Similarity and Retrieval

@julian_urbano University Carlos III of Madrid

@m_schedl Johannes Kepler University

Page 2: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Problem

evaluation of IR systems is costly

Annotations

time consuming expensive

boring

(Bad) Consequence

small and biased test collections unlikely to change from year to year

Solution

apply low-cost evaluation methodologies

Page 3: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

2011 1960

ISMIR (2000-today)

MIREX (2005-today)

TREC (1992-today)

CLEF (2000-today)

NTCIR (1999-today)

Cranfield 2 (1962-1966)

MEDLARS (1966-1967)

SMART (1961-1995)

nearly 2 decades of Meta-Evaluation in Text IR

a lot of things have happened here!

some good practices inherited from here

Page 4: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Minimal Test Collections (MTC) [Carterette at al.]

estimate the ranking of systems with very few judgments (high incompleteness)

Application in Audio Music Similarity (AMS)

dozens of volunteers required by MIREX every year to make thousands of judgments

Year Teams Systems Queries Results Judgments Overlap 2006 5 6 60 1,800 1,629 10% 2007 8 12 100 6,000 4,832 19% 2009 9 15 100 7,500 6,732 10% 2010 5 8 100 4,000 2,737 32% 2011 10 18 100 9,000 6,322 30%

Page 5: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

evaluation with

incomplete judgments

Page 6: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Basic Idea

treat similarity scores as random variables can be estimated with uncertainty

gain of an arbitrary document: Gi ⤳ multinomial

𝐸 𝐺𝑖 = 𝑃 𝐺𝑖 = 𝑙 · 𝑙

𝑙∈ℒ

ℒ𝐵𝑅𝑂𝐴𝐷 = 0, 1, 2 ℒ𝐹𝐼𝑁𝐸 = {0, 1, … , 100}

whenever document i is judged:

𝐸 𝐺𝑖 = 𝑙 𝑉𝑎𝑟 𝐺𝑖 = 0

*all variance formulas in the paper

Page 7: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

AG@k is also treated as a random variable

𝐸 𝐴𝐺@𝑘 =1

𝑘 𝐸 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘

𝑖∈𝒟

iterate all documents (in practice, only

the top k retrieved)

ranking at which it was retrieved

Ultimate Goal

compute a good estimate with the least effort

Page 8: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Comparing Two Systems

𝐸 𝛥𝐴𝐺@𝑘 =1

𝑘 𝐸 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘

𝑖∈𝒟

what really matters is the sign of the difference

Evaluating Several Queries

𝐸 𝛥𝐴𝐺@𝑘 =1

𝒬 𝐸 𝛥𝐴𝐺@𝑘𝑞

𝑞∈𝒬

iterate all queries

The Rationale

if then judge another document else stop judging

𝛼 < 𝑃 Δ𝐴𝐺@𝑘 ≤ 0 < 1 − 𝛼

Page 9: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Distribution of AG@k

𝑃 𝐴𝐺@𝑘 = 𝓏 ≔ 𝑃 𝐴𝐺@𝑘 = 𝓏 𝛾𝑘 · 𝑃 𝛾𝑘

𝛾𝑘∈𝛤𝑘

what are the possible assignments of similarity?

iterate all possible permutations of k

similarity assignments

ultimately depends on the distribution of Gi

Plain English

the ratio of similarity assignments s.t. AG@k=z

For Complex Measures or Large Similarity Scales

run Monte Carlo simulation

Page 10: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Actually, AG@k is a Special Case

let G be the similarity of the top k for all queries

1. take a sample of k documents. Mean = X1

2. take a sample of k documents. Mean = X2

...

Q. take a sample of k documents. Mean = XQ

Mean of sample means = X

Central Limit Theorem

regardless of the distribution of G

query AG@k for a single query

mean AG@k over all queries

as Q→∞, X approximates a normal distribution

Page 11: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

AG@k is Normally Distributed

use the normal cumulative density function Φ

𝑃 ∆𝐴𝐺@𝑘 ≤ 0 = Φ−𝐸 ∆𝐴𝐺@𝑘

𝑉𝑎𝑟 ∆𝐴𝐺@𝑘

BROAD scale

AG@5

De

nsity

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

FINE scale

AG@5

De

nsity

0 20 40 60 80 100

0.0

00

0.0

10

0.0

20

0.0

30

Page 12: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Confidence as a Function of # Judgments

Percent of judgments

Co

nfid

en

ce

in

ra

nkin

g o

f syste

ms

0 10 20 30 40 50 60 70 80 90 100

75

80

85

90

95

100

50

55

60

65

70

what documents should we judge? those that maximize the confidence

or keep judging to be really confident we can

stop judging

or waste our time

Page 13: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

The Trick

documents retrieved by both systems are useless there is no need to judge them

whatever Gi is, it is added and then subtracted

Comparing Several Systems

compute a weight wi for each query-document judge the document with largest effect

wi in the Original MTC

wi = largest weight across system pairs reduces to # of system pairs affected by query-doc i

Page 14: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

wi Dependent on Confidence

if we are highly confident about a pair of systems we do not need to judge another of their documents

𝑤𝑖 = 1− 𝐶𝐴,𝐵 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘2

𝐴,𝐵 ∈𝒮−ℛ

better results than traditional weights

iterate system pairs with low confidence

weight inversely proportional to confidence

even if it has the largest weight

Page 15: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

MTC for AMS

with AG@k

Page 16: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

MTC for ΔAG@k

while 1

𝒮 𝐶𝐴,𝐵𝐴,𝐵 ∈𝒮

≤ 1 − 𝛼 do

𝑖∗ ← 𝑎𝑟𝑔𝑚𝑎𝑥𝑖 𝑤𝑖

from all unjudged query-documents judge query-document 𝑖∗ (obtain true 𝑔𝑎𝑖𝑛𝑖∗) 𝐸 𝐺𝑖∗ ← 𝑔𝑎𝑖𝑛𝑖∗ 𝑉𝑎𝑟 𝐺𝑖∗ ← 0

end while

average confidence on the ranking

select the best document

update (increase confidence)

Page 17: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

MTC in MIREX AMS 2011

Page 18: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Why MIREX 2011

largest edition so far 18 systems (153 pairwise comparisons)

100 queries and 6,322 judgments

Distribution of Gi

let us work with a uniform distribution for now

Page 19: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Confidence as Judgments are Made

correct bins: estimated sign is correct or not significant anyway

Page 20: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Confidence as Judgments are Made

correct bins: estimated sign is correct or not significant anyway

Page 21: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Confidence as Judgments are Made

correct bins: estimated sign is correct or not significant anyway

Page 22: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

high confidence with considerably

less effort

Page 23: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Accuracy as Judgments are Made estimated bins always better than expected

Page 24: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Accuracy as Judgments are Made

estimated signs highly correlated with confidence

Page 25: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Accuracy as Judgments are Made

rankings with tau = 0.9 traditionally considered equivalent (same as 95% accuracy)

Page 26: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

high confidence and

high accuracy with considerably

less effort

Page 27: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Statistical Significance

MTC allows us to accurately estimate the ranking but for the current set of queries

can we generalize to a general set of queries?

Not Trivial

we have the variance of the estimates but not the sample variance

Page 28: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Work with Upper and Lower Bounds of ΔAG@k

Upper bound: best case for A Lower bound: best case for B

∆𝐴𝐺@𝑘 =1

𝑘 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘

𝑖∈𝜋

+

+1

𝑘 𝑙+ · 𝐼 𝐴𝑖 ≤ 𝑘

𝑖∈𝜋

−1

𝑘 𝑙− · 𝐼 𝐵𝑖 ≤ 𝑘 ∧ 𝐴𝑖 > 𝑘

𝑖∈𝜋

known judgments

*same for the lower bound

Page 29: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Work with Upper and Lower Bounds of ΔAG@k

Upper bound: best case for A Lower bound: best case for B

∆𝐴𝐺@𝑘 =1

𝑘 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘

𝑖∈𝜋

+

+1

𝑘 𝑙+ · 𝐼 𝐴𝑖 ≤ 𝑘

𝑖∈𝜋

−1

𝑘 𝑙− · 𝐼 𝐵𝑖 ≤ 𝑘 ∧ 𝐴𝑖 > 𝑘

𝑖∈𝜋

retrieved by A

*same for the lower bound

unknown judgments best

similarity score

Page 30: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Work with Upper and Lower Bounds of ΔAG@k

Upper bound: best case for A Lower bound: best case for B

∆𝐴𝐺@𝑘 =1

𝑘 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘

𝑖∈𝜋

+

+1

𝑘 𝑙+ · 𝐼 𝐴𝑖 ≤ 𝑘

𝑖∈𝜋

−1

𝑘 𝑙− · 𝐼 𝐵𝑖 ≤ 𝑘 ∧ 𝐴𝑖 > 𝑘

𝑖∈𝜋

*same for the lower bound

unknown judgments

retrieved by B but not by A

worst similarity

score

Page 31: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

3 Rules

1. Assume best case for A (upper bound) if A <<< B then conclude A <<< B

2. Assume best case for B (lower bound) if B <<< A then conclude B <<< A

3. If in the best case for A we do not have A >>> B and in the best case for B we do not have B >>> A then conclude they are not significantly different

Problem upper and lower bounds are very unrealistic

Page 32: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Incorporate a Heuristic

4. If the estimated difference is larger than t naively conclude significance

Choose t Based on Power Analysis

t = effect-size detectable by a t-test with • sample variance σ2=0.0615 • sample size n=100 • Type I Error rate α=0.05 • Type II Error rate β=0.15

t ≈ 0.067

from previous MIREX editions

typical values

Page 33: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Accuracy of the Significance Estimates

rule 4 (heuristic) ends up overestimating significance

pretty good around 95% confidence

Page 34: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Accuracy of the Significance Estimates

rule 4 (heuristic) ends up overestimating significance

rules 1 to 3 begin to apply and correct overestimations

Page 35: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Accuracy of the Significance Estimates

closer to expected

never under 90%

Page 36: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

significance can be estimated

fairly well too

Page 37: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

what we did

Page 38: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Introduce MTC to the MIR folks

Work out the Math for MTC with AG@k

See How Well it would have Done in AMS 2011 quite well actually!

Page 39: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

what now

Page 40: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Learn the true Distribution of Similarity Judgments

Significance Testing with Incomplete Judgments

Study Low-Cost Methodologies for other MIR Tasks

it‘s clearly not uniform would give more accurate estimates with less effort

use previous AMS data or fit a model as we judge

best-case scenarios are very unrealistic

Page 41: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

what for

Page 42: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

MTC Greatly Reduces the Effort for AMS (and SMS)

have MIREX volunteers incrementally create brand new test collections for other tasks

Better Yet

study low-cost methodologies for the other tasks

Not Only for MIREX

private collections for in-house evaluations no possibility of gathering large pools of annotators

lost-cost becomes paramount

Page 43: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

the MIR community needs a paradigm shift

from a priori to a posteriori evaluation methods

to reduce cost and gain reliability