affinity rank yi liu, benyu zhang, zheng chen msra
Post on 15-Jan-2016
225 Views
Preview:
TRANSCRIPT
Affinity RankYi Liu, Benyu Zhang, Zheng Chen
MSRA
Outline Motivation Related Work Model & Algorithm Evaluation Conclusion & Future work
Search for Useful Information
Full-text search
Importance Judgment
Manual compilation
Failure Still Exists
Example – “Spielberg”
Search
Example – “Spielberg” Search (Cont.)
Motivation Existing problem in IR applications
Similar search results dominate in top one/two pages Users feel tired to similar results of same topic Users cannot find what they need in those similar results
Situations where problem are/will be intensified Highly repetitive corpus, e.g.
Newsgroup News archive Specialized website
Generalized or short query
Diversity & Informativeness Diversity The coverage of different topics of a group of documents
InformativenessTo what extent a document can represent its topic locality
(high informativeness: inclusive)
Why? Traditional IR evaluation measure
Maximize relevance between query & results Most important results
To end-usersrelevant + important ≠ desirable
A way out Increase diversity in top results Increase the informativeness of each single results
Basic Idea Build similarity-based link map Link analysis Affinity Rank
indicating the informativeness of each document Rank adjustment
Only the most informative of each topic can rank high
Re-rank with Affinity Rank More diversified top results More informative top results
Related Work – link analysis Explicit
PageRank (Page et al. 1998)
HITS(Kleinberg, 1998)
Implicit DirectHit
(http://www.directhit.com) Small Web Search
(Xue et al. 2003)
Web author’s perspective
End-user’s perspective
Subjective Objective
Related Work – Clustering
Algorithm Complexity Naming
Scatter/Gather* O(kn) Centroid + ranked words
TopCat High Set of named entities
WBSC* O(m2+n) Ranked words
STC* O(n) Sets of N-grams
IF O(kn) -
PRSA O(knm) Ranked words
Bipartite O(nm)? Ranked words
n: #doc k: #clusters m: #words
* applied on clustering search results
Our proposed IR framework
AffinityGraph
Informativeness
DiversityPenalty
Relevance
DocumentCollection
Query
Query-independent
Query-dependentOutput
AffinityRank
Re-rank
Link Construction Similarity to directed link Directed graph Threshold
Save storage space Reduce noise brought by
overwhelmingly large amount of weak-similarity-links
BA
),cos(),( BABAsim
A
BAsimBAaff
),(),(
BA
B
BAsimABaff
),(),(
AssumptionObservation : relation among documents varies
Some are similar, others are not Similarity varies
The more relatives a document has, the more informative it is itself
The more informative a document’s relatives are, the more informative it is itself
Link Analysis Link map adjacency matrix Row Normalize Based on two assumptions
Principal eigenvector rank score Implementation: Power Method
ijall
jiji MARAR
,
~ M
~
n
cMARcAR
ijalljiji
)1(~
,
en
cc
)1(~ M
“Random Transform” Model A transforming document
jump from doc. to doc. at each time step
Markov Chainstationary transition probability principle eigenvector
informativeness
ccurrent
doc.
“relative” doc.
randomly picked doc.c1
) ( affinity
Rank Adjustment Greedy-like Algorithm
decrease the score of j by the part conveyed from i (the most informative one in the same topic)
T1-1
T1-6T1-5
T1-4
T1-3
T1-2
T2-3
T2-3
T2-1
iijjj ARMARAR ,
~
Re-rank Score-combine scheme
where
Rank-combine scheme
i
i
ii d
AR
AR
qSim
dqSimdqScore ,
log
log
)(
),(),(
),()( id dqSimMaxqSimi
id ARMaxARi
iARdqSimi dRankRankdqScoreii
, ),( ),(
1
Advantages of Affinity Rank Give attention to both diversity and
informativeness Implicitly expand the query towards the
multiple topics Automatically pick the representative ones for
each chosen topic Most of the computation can be computed
OFFLINE
Experiment Setup Dataset
Microsoft Newsgroup 117 Office product related newsgroups
256,449 posts (mainly in 4 months), about 400M Preprocess
Title & text body (citation, signature, etc. stripped) Stemming, stop words removal, tfidf weighting
Query Randomly picked 20 query scenarios with query words
Search Results Okapi Top 50 results as answer set
Evaluation – ground truth User Study
4 users independently evaluate all results For each query
First manually cluster all results into different topics Then score each result in terms of its informativeness in
corresponding topic Finally score each result in terms of its relevance to the
query Evaluation
Compare original ranking with new ranking (re-ranked by Affinity Rank)
3 aspects of ranking concerned -- diversity, informativeness & relevance in top n results
Definitions Diversity
diversity = No. of different topics in a document group
Informativeness3 - very informative2 - informative1 - somewhat informative0 - not informative
Relevance1 - relevant0 – hard to tell-1 - irrelevant
Experiment Result (1) Top 10 search results
Compared to traditional IR results
DiversityInformative
nessRelevance
RelativeChange
+31.02% +11.97% +0.72%
p value(t-test)
0.004632 0.002225 0.067255
Significant improvement in diversity & informative
without loss in relevance
Experiment Result (2) Diversity Improvement Informative Improvement
Affinity Rank efficiently improves both diversity & informativeness of top search results
(Re-ranking top 50 results all by Affinity Rank, e.g. )0iw
Experiment Result (3) - Parameter Tuning
Top 10 search results
Affinity Rank is robust
1. Parameter doesn’t affect much if enough weight is given
2. No over-tune problem - Simply re-rank all by Affinity Rank is nearly optimal)
Experiment Result (4) - Parameter Tuning
Improvement overview subject to weight adjustment
Affinity Rank STABLELY exerts positive influence on
diversity & informativeness enhancement
Conclusion A new IR framework Affinity Rank can help to improve diversity &
informativeness of search results, especially for TOP ones
Affinity Rank is computed offline, therefore brings few burden in online retrieval
Future work Metrics for information quantity measurement Scale to large collection
Thanks
top related