iir 08 ver.1.0

Introduction to Information Retrieval

Chapter 8:Evaluation in IR

引用元（参照先）

• IIR のサイト– http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html

• 本と同等の内容を公開• Stanford CS276 での Slide を公開

• はてなおやさんの説明スライド– http://bloghackers.net/~naoya/iir/ppt/

• Y!J Labs たつをさんによる補足情報– http://chalow.net/clsearch.cgi?cat=IIR

• 基本的にこれらの資料を切り貼り、さらに私の知識と分析を追加して資料を作成しました

http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html

http://bloghackers.net/~naoya/iir/ppt/

http://chalow.net/clsearch.cgi?cat=IIR

http://chalow.net/clsearch.cgi?cat=IIR

IIR 重要部分

• 情報推薦にとっては– ６、７、９、１８、 19 章あたりが重要と考え

ます• 6 章 scoring( 理論より )• 7 章 scoring( 実装より )• 8 章評価手法• 9 章 relevance feedback• 18 章 Scale する実装

– Matrix decompositions, LSI, 特異値分解など• 19 章 PageRank, HITS など

IIR 08: Table of contents

• 8 Evaluation in information retrieval 151– 8.1 Information retrieval system evaluation 152– 8.2 Standard test collections 153– 8.3 Evaluation of unranked retrieval sets 154– 8.4 Evaluation of ranked retrieval results 158– 8.5 Assessing relevance 164

• 8.5.1 Critiques and justifications of the concept of relevance 166

– 8.6 A broader perspective: System quality and user utility 168• 8.6.1 System issues 168• 8.6.2 User utility 169• 8.6.3 Refining a deployed system 170

– 8.7 Results snippets 170– 8.8 References and further reading 173

IIR 08 KEYWORDS

• relevance, gold standard=ground truth, information need, development test collections, TREC,precision, recall, accuracy, F measure,precision-recall Curve, interpolated precision, eleven-point interpolated average precision, mean average precision(MAP), precision at k, R-precision, break-eleven point, ROC curve, sensitively,specificity, cumulative gain, normalized discounted cumulative gain(NDCG),pooling, kappa statistic, marginal, marginal relevance,A/B testing, click rough log analysis=clickstream mining,snipet, static, summary<->dynamic summary, text summarization, keyword-in-context(KWIC),

Evaluating search engines

7

明確な測定指標

• How fast does it index– Number of documents/hour– (Average document size)

• How fast does it search– Latency as a function of index size

• Expressiveness of query language– Ability to express complex information needs– Speed on complex queries

• Uncluttered UI• Is it free? 評価法としては簡単

8

明確でない測定指標• ユーザ満足度（ user happiness ）の定量的解析が必要

– ユーザ満足度とは?– 応答スピードやインデックスサイズも要因– しかし、不要なanswersはユーザをハッピーにはできないことは明白

• 我々がハッピーにしたいユーザとは誰なのか？– Depends on the setting

• Web engine: ユーザが欲しいものをクリックなどのフィードバックで取得• eCommerce site: ユーザが欲しいものを購入

– 満足度を測るのはエンドユーザか、eコマースサイトか？– 購入までの時間、購入した人の特徴

• Enterprise (company/govt/academic): ユーザの生産性が大事– 時間の save 情報を探すための時間– 情報の幅広さ（検索対象が幅広い、検索結果が固定ではない）、安全なアクセスな

ど

どう評価したら良いのかが難しい

9

Happiness: elusive to measure

• Most common proxy: relevance of search results

– But how do you measure relevance?• We will detail a methodology here, then

examine its issues• Relevant measurement requires 3 elements:

1. A benchmark document collection2. A benchmark suite of queries3. A usually binary assessment of either Relevant or

Nonrelevant for each query and each document• Some work on more-than-binary, but not the standard

10

Evaluating an IR system

• Note: the information need is translated into a query• Relevance is assessed relative to the information need

not the query– E.g.,

• Information need: I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.

• Query: wine red white heart attack effective

• ∴ 人力による適合性判定データが必要

query information need⊂

標準的なテストコレクション

Cranfield パイオニア。現在では小さすぎる

TREC NIST による Text Retrieval Conference で使われたもの。 450 の情報ニーズ、 189 万文書

GOV2 NIST による。現在研究目的に利用できる最大の Web コレクション。 2,500 万ページ

NTCIR Asia 版の TREC. 東アジア言語 / クロス言語にフォーカス。 TREC と同規模。 (marginal 評価データあり )

CLEF ヨーロッパ言語と言語横断情報検索に集中

Reuters Reuter-21578, Reuter-RCV1 。テキスト分類のために最もよく使われてきた。 RCV1 は 806,791 文書

20 Newsgroups

Usenet の 20 グループの記事。テキスト分類で広く利用される。 18,941 記事。

※ 最近は Wikipedia のアーカイブも良く利用されるとか。他に MovieLens や Netflix な

ど

検索結果の評価

IIR-08 サマリ• ランク付けなしの検索結果の評価

– positive / negative, true / false– Precision と Recall– P と R のトレードオフ指標 → F 値

• ランク付けされた検索結果の評価– Presicion - Recall 曲線

• 補完適合率• 曲線を調べる統計手法 ... 11 point interpolated average precision

– → より良い統計指標に MAP– MAP では判断しづらい物 (Web 検索 etc) → Precision-top K → R-Precision– ほか

• ROC 曲線• NDCG

• 情報ニーズに対する適合性の評価– kappa statistic

ランク付けなしの検索結果の評価

（ランク付けなしとは　絶対的な0/1推定）

positive/negative -> true/false

• 推定内容陽性 : positive (p)陰性 : negative (n)

• 推定内容の正確さ正解 : true (t)不正解 : false (f)

relevant retrieved

tpfn fp

tn

Precision and Recall

Precision = tp/(tp+fp) (= tp/p)

ゴミの少なさ

relevant retrieved

tpfn fp

tn

Recall　 =tp/(tp+fn) 検索もれの少なさ

欠点：全ドキュメントを retrieved とすれば１にできて

しまう

Pecision と Recall は trade-off Ex.8.1

Accuracy and Jaccard Index

Accuracy =(tp+tn)/(tp+fp+fn+tn) (=t/(t+f))

Jaccard index =tp/(tp+fp+fn)

relevant retrieved

tpfn fp

tn

non-relevant の割合が 99.9% だと全てを negative と推定すれば Accuracy

が高くなってしまう

例 : 試験者の中で 0.1% が癌でもみんな癌でないと判定すれば

99.9% の正解率それぞれの利点・欠点

F-measure

• P と R の加重調和平均 ( 加重平均だと良くない )

• β=1(α=0.5) の時の F を代表的な F-measureである F1 と呼ぶ

全ドキュメントをretrieved とすれば0.5 にできてしま

う

Ex.8.2, Ex.8.3Ex.8.7

19

F1 and other averages

Combined Measures

0

20

40

60

80

100

0 20 40 60 80 100

Precision (Recall fixed at 70%)

Minimum

Maximum

Arithmetic

Geometric

Harmonic

ランク付けありの検索結果の評価

（ランク付けありとは　相対的なオーダー）

ランクありの検索結果

• Precision, Recall, F 値は集合ベースの方法 → ランクありでは拡張する必要あり

• 検索された文書の集合 = top K の検索結果

22

A precision-recall curve とInterpolated Precision

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Recall

Pre

cisi

on

Interpolated Precision (Pinterp)

True なら右上、False なら下に向かってい

る

Ex.8.4

23

Evaluation

• Graphs are good, but people want summary measures!

– Precision at fixed retrieval level• Precision-at-k: Precision of top k results• Perhaps appropriate for most of web search: all people

want are good matches on the first one or two results pages

• But: averages badly and has an arbitrary parameter of k

– 11-point interpolated average precision• The standard measure in the early TREC competitions:

you take the precision at 11 levels of recall varying from 0 to 1 by tenths of the documents, using interpolation (the value for 0 is always interpolated!), and average them

• Evaluates performance at all recall levels

11point interpolated average precision

※ グラフを見て妙な特異点がないかなどを調査する

Recall=0 の点は暴れやすい

単調減少かそれに近い方が良い

Ex.8.5, Ex.8.6

MAPMean Average Precision

• Q 情報ニーズの集合• mj 情報ニーズ j の適合文書の数• Rjk 情報ニーズ j の top から文書 k までの retrieved 集

合• MAP の特徴

• Interpolated ではない• Recall ではなく，適合文書ごとの値の和• 複数の情報ニーズにおける平均値

Recall 軸が基準

Precision at K / R-Precision(1 点で ) 評価

• MAP のように全 retrieved を見る必要があるの？• Web 検索では top 10 から 30 の precision が重要

– 平均ではなく適切な１つでいいのでは？→ precision at K, R-Precision

• Precision at K– 上位 K 個の retrieved 集合の Precision

• でも K って何が適切な数なの？情報セットごとに違うんじゃないの？

• K= |Rel| (Rel: set of relevant document) とした Precision at K が R-Precision (K は Recallを 1 にできる可能性のある最小値 )• 答えは５つある、これはと思う５つを選んでみよ、という感じ• この値において Precision = Recall となる

• R-Precision は１点での評価だが MAP とかなり相関がある

|Rel| が分からない場合はできない

ご参考： TREC などでは MAP と R-precision （ Non-Interpolated ）が使われている

ユーザの労力が基準

Recall 軸が基準

Ex.8.8,EX.8.9

その他の指標

ROC 曲線• Precision / Recall 曲線は全体に対する relevant document の割合で形が多

く違う（違う情報ニーズ間の比較はできない）• 縦軸を recall 、横軸を false-positive 率 ( fp / (fp + tn) ) ... " 見えたゴミ率 "• ゴミが見えるのをどの程度許容できたら recall が上がるか• Top 　 k を見るには不適、全体を見るには適する

Retrieved したNon-relevant document

の割合

Retrieved したrelevant document

の割合

このグラフ上でprecisionはどのように評点され

るか

NDCG(Normalized Discounted Cumulative Gain)

• "marginal" な適合性判断を加味した指標• 機械学習による適合性判断をした場合などに使われる• パラメータ設定が大事

– k と log の底ユーザの労力が基準

ご参考： MSN Search Engine は NDCG の一種を使っていると言われている

Logの底はどのように設定する

か私はこの重み付けを考えを ROC Curve に適用し、 MovieLens によって評価したことがあ

る

NDCG といっても一意の方式ではない

情報ニーズに対する適合性の評価

適合性の評価

• そもそも適合とは• 主観的な判断• さらにユーザは同じ状況でも全く同じ選択をするとは限らない（ゆらぐ）

• そのテストデータが真に適合かどうか → 統計指標でその品質を算出

32

Kappa measure for inter-judge (dis)agreement

• Kappa measure– 判断の一致率–カテゴリの判断のために設計された– “ 偶然の一致” を補正する統計指標

• Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ]– P(A) – proportion of time judges agree– P(E) – what agreement would be by chance

• Kappa = 0 偶然の一致• Kappa = 1 完全に一致

Kappa Measure: Example (from lecture08...ppt)

Number of docs

Judge 1 Judge 2

300 Relevant Relevant

70 Nonrelevant Nonrelevant

20 Relevant Nonrelevant

10 Nonrelevant relevant

34

Kappa Example

• P(A) = 370/400 = 0.925• P(nonrelevant) = (10+20+70+70)/800 = 0.2125• P(relevant) = (10+20+300+300)/800 = 0.7875• P(E) = 0.2125^2 + 0.7875^2 = 0.665• Kappa = (0.925 – 0.665)/(1-0.665) = 0.776

• Kappa > 0.8 = good agreement• 0.67 < Kappa < 0.8 -> “tentative conclusions” (Carletta ’96)• Depends on purpose of study • For >2 judges: average pairwise kappas Ex.8.10

8 章その他の話題 (読み物的 )

• 検索結果のフォーマルな指標以外に、ユーザーが快適度を判断する軸– 検索スピード、ユーザビリティ、 etc– 二値判断でなく "marginal" な判断をどう加味するか

• 定量的な評価 vs 人間の主観による評価– A/B testing

• ユーザ分け

• Snnipets– 静的 / 動的

36

Can we avoid human judgment?

• No• Makes experimental work hard

– Especially on a large scale• In some very specific settings, can use proxies

– E.g.: for approximate vector space retrieval, we can compare the cosine distance closeness of the closest docs to those found by an approximate retrieval algorithm

• But once we have test collections, we can reuse them (so long as we don’t overtrain too badly)

37

Fine.

• See also– 酒井哲也（東芝），”よりよい検索システム実

現のために：正解の良し悪しを考慮した情報検索評価動向”， IPSJ Magazine ， Vol.47, No.2, Feb.,2006

• http://voice.fresheye.com/sakai/IPSJ-MGN470211.pdf

http://voice.fresheye.com/sakai/IPSJ-MGN470211.pdf

iir 08 ver.1.0

Technology