gravitation-based model for information retrieval shuming shi, ji-rong wen, qing yu, ruihua song,...
TRANSCRIPT
![Page 1: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/1.jpg)
Gravitation-Based Model for Information Retrieval
Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma
Microsoft Research AsiaSIGIR 2005
![Page 2: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/2.jpg)
INTRODUCTION Most IR models fail to satisfy some basic int
uitive heuristic constraints
IR models commonly lack intuitive physical interpretations
View information retrieval from the perspective of physics (theory of gravitation)
![Page 3: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/3.jpg)
INTRODUCTION Derived a ranking formulas which satisfy al
l heuristic constraints
The BM25 term weighting function can be easily derived from our basic model
A more reasonable approach for structured document retrieval
![Page 4: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/4.jpg)
Newton’s Theory of Gravitation
masses m1, m2 distance d universal gravitation constant G
![Page 5: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/5.jpg)
Okapi’s BM25 formula
Origin representation of w(t) has the potential “negative IDF” so we will use ln((N+1)/df(t))Amati and Rijsbergen propose a nonparametric term weighting functions. They claim that the BM25 function with some special parameters (k1=1.2,b=0.75; or k1=2, b=0.75) can be approximated numerically by one of their generated functions
![Page 6: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/6.jpg)
Structured Document Retrieval The most commonly used approach f
or structured document retrieval may be score/rank linear combination
Robertson combines term frequencies instead of field scores
![Page 7: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/7.jpg)
GRAVITATION-BASED MODEL A term is viewed as a physical object
composed of some amount of particles A particle has two attributes
Type: Determined by the term it composes Mass
![Page 8: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/8.jpg)
Term Object A term is composed of particles with a
specified structure (or shape). Sphere ideal cylinder - a cylinder whose radius is
small enough and can be viewed as a line There are three attributes related to a
term type, mass, and diameter
![Page 9: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/9.jpg)
Term Object
Sphere-shape terms Ideal-cylinder-shape terms
![Page 10: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/10.jpg)
Document Object There are two categories of term objects i
n each document explicit (or seen) terms - all the terms occurre
d in the content of the document implicit (or hidden) terms - not occurred in th
e document, used to represent the hidden meaning of the document
We use |D| and |H(D)| to represent the number of explicit and implicit terms
![Page 11: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/11.jpg)
Document Object Mass of a document is the total masses of all its seen a
nd hidden terms
The diameter of a document is defined as the sum of the diameters of all its terms
a query is modeled as an object composed of its terms (no hidden terms are assumed for simplicity)
![Page 12: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/12.jpg)
Basic Model : Discrete Version
![Page 13: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/13.jpg)
Basic Model : continuous version
![Page 14: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/14.jpg)
Mass and Diameter Estimation
Assumption 1 For any two terms, their mass ratio in any documen
t is equal to the ratio of their average masses in the whole collection
Assumption 2 The average mass of a term depends on and only o
n the inverse document frequency
![Page 15: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/15.jpg)
Mass and Diameter Estimation Assumption 3
The average global importance of all terms in a document is constant
Now we try to estimate the mass of a term in a document
![Page 16: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/16.jpg)
Mass and Diameter Estimation Assumption 4
The number of hidden terms in a document relies only on the statistic information of the whole collection
![Page 17: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/17.jpg)
Model Analysis
Continuous model
![Page 18: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/18.jpg)
Model Analysis Discrete model
![Page 19: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/19.jpg)
Relationship with the BM25 formula
In previous section, if m(D) and di(D) are constant, then it is easy to prove that Continuous model is equivalent to simplified BM25
![Page 20: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/20.jpg)
Field Functions
![Page 21: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/21.jpg)
Structured Document Retrieval Because masses of terms in field 1 are
larger than those of field 2, field 1’s terms are nearer to the query
![Page 22: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/22.jpg)
Analysis and Comparison Score combination
A single query term over many fields could get unreasonably higher score than matches all the query terms in a few fields
TF combination Fields weights F1=5 , F2=1
score of d3 contains t in its high-weight field, and enough term in its low-weight field should be larger than that of d4
![Page 23: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/23.jpg)
EXPERIMENTS Experiments on two corpora and seven query s
ets used from TREC 2000 to 2004
![Page 24: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/24.jpg)
Term Weighting Experiments
VSM-Piv : Pivoted normalization VSM formula LM-Dir :Language model formula with Dirichlet prior smoothing GBM-Inv :the GBM formula derived by gravitational field function
1/x GBM-Exp :the GBM formula derived by e^ -x GBM-Std :equivalent to the BM25 formula
![Page 25: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/25.jpg)
Field Structure Experiments The scores for the baseline, ScoreComb, and Fr
eqComb are all generated by the BM25(b=0.75)
![Page 26: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/26.jpg)
CONCLUSION AND FUTURE WORK In this paper, a gravitation-based IR m
odel (GBM) was proposed GBM model can not only give a physic
al interpretation of BM25, but also derive new effective ranking formulas
The derived term weighting functions satisfies all the heuristic constraints
![Page 27: Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005](https://reader034.vdocuments.net/reader034/viewer/2022051516/56649f585503460f94c7dbc8/html5/thumbnails/27.jpg)
CONCLUSION AND FUTURE WORK A novel approach for structured docu
ment retrieval Use static document ranks e.g. PageR
ank to derive m(D) Study the relationship and possible c
ombination between our model and existing models