diversity in ranking via resistive graph centers avinava dubey ibm research india soumen chakrabarti...
TRANSCRIPT
Diversity in Rankingvia Resistive Graph Centers
Avinava DubeyIBM Research India
Soumen ChakrabartiIIT Bombay
Chiranjib BhattacharyyaIISc Bangalore
2
PageRank: Conventional view Inputs
• Graph with edge conductance matrix C• Personalized teleport distribution r• Walk with probability, teleport w.p. 1• “Biased random surfer”
Output• Steady state visit distribution• “You should emulate the
aggregate behavior ofmany random surfers” r
i
j
3
User view: Exact opposite! Random search-guided surfer Search engine knows relevant subgraph But user can inspect only a few hits Search engine outputs sparse teleport r
Corpus
4
User view: Exact opposite! User diffuses out through sparse teleport Occasionally teleports back to search results Eventually explores green subgraph (Red, green “boundaries” are probabilistic)
Corpus
5
Diffusion defined via subsumption Original PageRank: diffusion via hyperlinks But frequently used with other kinds of edges Suppose surfer is on page i And, having read i, there is no new info in j Then let C(j|i), also written as C(ij) be
large
6
Graph center diversity (GCD) Suppose the searcher can click through at most
three links returned by the search engine If any of the pages could be potentially relevant, … … then we cannot waste teleports on one cluster
A natural definition of diversity
7
Formulation summary thus far Search engine knows what’s best for query
• Node i has relevance b(i)
User has limited patience scanning results• r must be sparse: at most K positive elements
Conductance matrix C and walk probability predict user behavior once given r
Steady state visit probabilities given by
Inference, hard: design sparse r to minimize
8
Attention decay profile
Design a teleport r with decaying weights
So as to align weighted merged clouds with b
Attentionprofile
9
Learning subsumption C(ij) How strongly does i render j redundant? Associate edge ij with features { f } Each f has associated fixed conductance
matrix Cf and personalized PageRanks Mf
Training: Given diverse node sets (r*), learn the convex combination defined by
Simple heuristic (convex optimization):
10
Structured learning style formulation More accurately, any r r* should do
worse Define a loss
Combine over query instances
Paper gives an online update algorithm to improve iteratively (exponentiated gradient)
Divergence for r Divergence for r*
11
Marginal utility methods Max marginal relevance (MMR)
• Given q, already chose subset S; next choice is
SubTopic• Similar to MMR
• sim1 and sim2 use probabilistic topic models
SVMdiv• Learns subtopic coverage from word coverage
12
PageRank based diversity models Grasshopper
• Edges associated with fixed similarity scores• Best node has highest PageRank• Make best node a sink, run PageRank again• Note, no meaningful steady state, Pr(sink)=1• Next best node has largest expected number of
visits before walk absorbed in sink
DivRank• With visits to node j, inbound edges get thicker• Rich gets even richer than you expected• Tiebreaking causes one cluster member to win
13
Submodular set selection Sounds similar to MMR but on a graph
• Undirected edge (i,j) has weight wij
Given node set V, select subset S so as to• Maximize coverage of V \S:• Minimize redundancy within S:
Additional size budget constraint Hard, but provable approximations No learning of edge weight/conductance
S V \S
14
Experiments: Three diverse domains Subtopic information retrieval (TREC)
• Query under-specified or ambiguous• Balance responses across subtopics or facets
Social network search (IMDB)• List high-prestige actors without knowing country• Diversity many countries covered
Extractive document summarization (DUC)• Choose subset of sentences• That are representative of the whole document• And do not render each other redundant
15
Subtopic information retrieval results
Ground truth has subtopics covered by each doc Subtopic-aware precision vs. recall GCD dominates other subtopic IR approaches
16
Effect of training
Uniform all f equal Maxent = convex heuristic minimizing KL
divergence between b and PageRank EG = Exponentiated gradient Successive improvements in subtopic-aware mean
average precision
17
Ranking in social networks (IMDB) 3452 actors, 1027
movies, 47 countries Actor’s prestige
depends on prestigeof movies where s/he has worked
Rank actors by prestige
GCD rapidly increases distinct countries
While also increasing number of movies
18
Document summarization DUC 2004, task 2, ROUGE-1 30, 20 summaries to train, test MMR, SubTopic not
competitive Associative graph diffusion
(Grasshopper, DivRank) worse than GCD and Submodular
GCD comparable to Submodular even without using sentence size budget constraints
Algorithm Train Test
MMR 0.324 0.32
SubTopic 0.32 0.323
Grasshopper 0.341 0.33
DivRank 0.353 0.345
GCD 0.377 0.374
Submodular 0.389 0.373
Optimal 0.421 0.407
19
Conclusion A novel model for redundancy and diversity Based on an “inverted” notion of PageRank Inference amounts to finding centers in
conductance graphs• “GCD”, graph center diversity
Bonus: learn conductance via edge features GCD shows better or similar performance in
three diverse application domains
20
Bibliography1. J. Carbonell and J. Goldstein. The use of MMR, diversity-based
reranking for reordering documents and producing summaries. In SIGIR Conference 1998.
2. C. X. Zhai, W. W. Cohen, and J. Laerty. Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. In SIGIR Conference 2003.
3. X. Zhu, A. B. Goldberg, J. Van, and G. D. Andrzejewski. Improving diversity in ranking using absorbing random walks. In HLT-NAACL 2007.
4. Y. Yue and T. Joachims. Predicting diverse subsets using structural SVMs. In ICML, 2008.
5. Q. Mei, J. Guo, and D. Radev. DivRank: the interplay of prestige and diversity in information networks. In SIGKDD Conference, 2010.
6. Hui Lin, Jeff Bilmes. Multi-document Summarization via Budgeted Maximization of Submodular Functions, NAACLHLT 2010.