improved search for socially annotated data authors: nikos sarkas, gautam das, nick koudas presented...
TRANSCRIPT
![Page 1: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/1.jpg)
Improved search for Socially Annotated DataAuthors: Nikos Sarkas, Gautam Das, Nick KoudasPresented by: Amanda Cohen Mostafavi
![Page 2: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/2.jpg)
Introduction• Social Annotation: A process where users
collaboratively assign a short sequence of keywords (tags) to a number of resources▫Each tag sequence is a concise and accurate
summary of the resource’s content▫Meant to aid navigation through a collection
• Leads to searching via tags▫Enables relevant text retrieval▫Allows accurate retrieval of non-textual objects▫Presents a need for an efficient retrieval and
ranking method based on user tags
![Page 3: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/3.jpg)
RadING
•Ranking annotated data using Interpolated N-Grams
•Searching and ranking method based exclusively on user tags
•Uses interpolated n-grams to model tag sequences associated with every resource
•How does it rank?
![Page 4: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/4.jpg)
Probabilistic Foundations
•Goal: To rank resources by the probability that they will be relevant to the query
•Given keyword query Q, and a collection of resources R, we apply Bayesian theorem to get:
p(R is relevant | Q) = p(Q|R is relevant)p(R is Relevant)
p(Q)
Where p(R is relevant) is the probability that R is relevant, independent of the query posed and p(Q) is the probability of the query issued
![Page 5: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/5.jpg)
Probabilistic Foundations
•p(R is relevant) is constant throughout the resource collection, as well as p(Q)▫Meaning: ranking resources by p(R is
relevant|Q) is equivalent to ranking by p(Q|R is relevant)
•In order to estimate the probability of the query being “generated” by each resource, resources need to be modeled based on knowledge of social annotation
![Page 6: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/6.jpg)
Dynamics and Properties of the Social Annotation Process•The goal of the tagging process is to
describe the resource’s content•User opinions crystallize quickly, can find
annotation trends after witnessing a small number of assignments
•Therefore we assume the following:▫p(Q | R is relevant) = p(Q is used to tag R)▫In English: Users will use keyword
sequences derived from the same distribution to both tag and search for a resource
![Page 7: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/7.jpg)
Social Annotation Process: Things to consider…•Resources are rarely given assignments
with one tag•Also, tag positions are not random,
progress from left to right from more general to more specific
• Tags representing different perspectives on a resource are less likely to occur together in the same assigment
•Used n-gram models to model these co-occurance patterns
![Page 8: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/8.jpg)
N-gram Models
•Given an assignment made up of a sequence (s) of l tags t1…tl, the probability of this sequence being assigned to a resource is:▫p(t1,…,tl ) = p(t1)p(t2|t1)…p(tl|t1,…, tl-1)
•The purpose of using n-gram models is to approximate the probability of a subsequence with only the last n-1 tags▫In the case of a bi-gram model, p(tk|t1,…,tk-1)
approximates to p(tk|tk-1)
![Page 9: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/9.jpg)
N-gram Models
•Calculate the probability using the Maximum Likelihood equation
•c(t1, t2) = the number of occurrences of the bi-gram
•The summation is the sum of the occurrences of all bigrams involving t1 as the first tag
t
ttc
ttcttp
),(
),()|(
1
2112
![Page 10: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/10.jpg)
Interpolation
•Interpolation is used to compensate for sparse data, distributes probability mass from high counts to low counts
•Used the Jelinek-Mercer interpolation technique. Applied to a bi-gram, yields:
1
10
)()(ˆ)|(ˆ)|(
210
2,1,0
202112212
tptpttpttp bg
![Page 11: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/11.jpg)
Parameter Optimization
•Goal: to maximize the likelihood function L(λ1,λ2) in order to find the ideal interpolation parameters
•Definitions:▫D*: The constrained domain of λ1 and λ2
▫λ*: The global maximum of L(λ1,λ2)
▫λc : The point at which L(λ1,λ2) evaluates to its maximum value within D*, which must be found to optimize parameters
![Page 12: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/12.jpg)
RadING Optimization Framework•Step 1: If L(λ1,λ2) is unbounded, perform
1D optimization to locate λc
•Step 2: If L(λ1,λ2) is bounded, apply 2D optimization to find λ*
•Step 3: If λ* is not in D*, locate λc
![Page 13: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/13.jpg)
Searching Process•Step 1: Train a bi-gram model for each
resource▫Compute the bi-gram and unigram probability
and optimize the interpolation parameters•Step 2: At query-time compute the probability
of the query keyword sequence being generated by each resource’s bi-gram model
•Use Threshold Algorithm to compute top-k results
k
j
jjkR qqpqqp1
11 )|(),...,(
![Page 14: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/14.jpg)
Searching Example
![Page 15: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/15.jpg)
Experimental Evaluation
•Test data: web crawl of del.icio.us▫70,658,851 assignments▫Posted by 567,539 users▫Attached to 24,245,248 unique URLs▫Average length of assignment: 2.77▫Standard deviation: 2.70▫Median: 2
![Page 16: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/16.jpg)
Optimization Efficiency
![Page 17: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/17.jpg)
Optimization Efficiency
![Page 18: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/18.jpg)
Optimization Efficiency
![Page 19: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/19.jpg)
Ranking Effectiveness
•Compares RadING ranking method to adaptations of tf/idf ranking▫Tf/Idf: concatenates resources’ assignments
into a document and performs raking based tf/idf similarity to each document
▫Tf/Idf+: computes tf/idf similarity of each individual assignment and rank resources based on average similarity
•10 Judges contacted through Amazon Mechanical Turk to measure precision
![Page 20: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/20.jpg)
Ranking Effectiveness
![Page 21: Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi](https://reader036.vdocuments.net/reader036/viewer/2022070409/56649e905503460f94b94371/html5/thumbnails/21.jpg)
Ranking Effectiveness