matching reviews to objects using a language model€¦ · information retrieval(ir) –to match...
TRANSCRIPT
Matching Reviews
to Objects using a Language Model
Nilesh Dalvi, Ravi Kumar, Bo Pang, Andrew Tomkins
Yahoo! Research
ACL and AFNLP, 2009
15 April, 2014
Jaehwan Lee
2 / 23
Outline
Introduction
Related Work
Model and Method
Data
Evaluation
Conclusions
3 / 23
Introduction
the Search Engine would like
– to offer a high quality result set for even obscure restaurants
– to enable advanced applications and recommendation
To solve them, It faces two high-level challenges
– identify the restaurant review pages on the Web
– identify the restaurant that is being reviewed
Notice
– restaurant reviews are running example
– “the techniques are general”
4 / 23
Introduction
the Search Engine would like
– to offer a high quality result set for even obscure restaurants
– to enable advanced applications and recommendation
To solve them, It faces two high-level challenges
– identify the restaurant review pages on the Web
– identify the restaurant that is being reviewed
Notice
– restaurant reviews are running example
– “the techniques are general”
5 / 23
Introduction
Two Settings of Related Flavor
Entity Matching
– to find the correspondence between two structured objects
Information Retrieval(IR)
– to match unstructured short text against unstructured text
6 / 23
Introduction
Classical IR Methods Doesn’t Fit
Example of “Food”
– “food” is rare as a restaurant name
– thus, it will get a very high IDF score
– AND hence will likely be the top match for all reviews containing the word “food”
UNLIKE in traditional IR
– a query (i.e. review) is long and a document (i.e. restaurant) is short
document reviewquery restaurant
traditional IR matching Review
7 / 23
Introduction
Our Their Contributions
The intuition behind their model is simple and natural
– When a review is written about an object,
– each word in the review is drawn either from a description of the object or from a generic review language that is independent of the object
review
Word 1
Word n
…
generic review language(Object-independent)
description of object(Object-dependent)
8 / 23
Related Work
Opinion topic identification
– Some work on fine-grained opinion extraction from reviews
– focused on identifying product features of the object under review, rather than object itself
Language modeling
– to postulate a model for each document
– to select the document that is most likely to have generated for a given query
Entity matching
– consider pairwise attribute similarities between entities
– exploit the relationships that exist between entities
9 / 23
Model and Method
r : a review
: a collection of reviews
e : an object, has a set of attributes
: a set of objects
text(e) : the union of the textual content of all its attributes
: the probability the word w is chosen according some object-independent distribution
: the probability the word w is chosen according some object-dependent distribution
E
R
10 / 23
Model and Method
Review Language Model (RLM)
It represent the probability that a review r is a review about object e when e exists in r
alpha is a parameter (0 < alpha < 1)
Modeling
– is object-dependent
– is object-independent (generic review feature)
11 / 23
Model and Method
Review Language Model (RLM)
It can be zero, if a word w is not in text(e)
Thus, have to modify the equation as following
12 / 23
Model and Method
Review Language Model (RLM)
By assuming a uniform distribution for Pr[e], we get
13 / 23
Model and Method
Review Language Model (RLM)
By assuming a uniform distribution for Pr[e], we get
How?
14 / 23
Model and Method
Review Language Model (RLM)
Object-independent factor
– By treating the set of processed reviews where for each review-object pair (r, e), words in text€ are remove from r as an approximation of
– Then, we can compute in the aforementioned manner
Object-dependent factor
– By using the frequency fw of the word w in or in
R(g)
R
15 / 23
Model and Method
RLM, TFIDF and TFIDF+
Generic equation
for RLM, f(w) goes
for TFIDF and TFIDF+, f(w) goes document query
TFIDF+
16 / 23
Data
299,762 reviews
– each aligned with one of a set of 12,408 unique restaurants hosted on Yelp (yelp.com)
– no more than 40 reviews per each restaurants
681,320 restaurants from Yahoo! Local database
Task
– to match a given Yelp review, using ONLY its free-form textual content
17 / 23
Data
The Final Aligned Dataset
– 24,910 Yelp reviews covering 6,010 restaurants
– to estimate the models
– reviews filtered out because of lack of identifying information were added
– 205,447 reviews
– to evaluate RLM
– 11,217 reviews
There are no overlapping restaurants between them
Rtest
R’
R
18 / 23
Evaluation
Unlike a standard IR task
– not interested in retrieving multiple relevant objects
– each review in dataset has only one single correct match from
Macro vs. micro average
– Macro average
first, compute the average for reviews about the same restaurant
and report the average over all restaurants
– micro average
take the average accuracy over all reviews
Accuracy @ k
– consider a review is correctly matched if one of the top-k objects returned is the correct match
E
19 / 23
Evaluation
Main Result
20 / 23
Evaluation
Main Result
21 / 23
Evaluation
Main Result
Longer reviews might be more difficult to match since they may include more proper nouns such as dish names and related restaurants, and yield a longer list of highly competitive candidate objects.
22 / 23
Evaluation
Main Result
Choices for RLM
– RLM-Uniform
– RLM-Uncut
– RLM-Decap
Revisiting TFIDF+
– Object Length Normalization
– Dampening
– Removing mentions of objects
Using term counts
– each of the other modeling decisions incorporated in RLM is important
23 / 23
Conclusions
The model provides us a principled way to match reviews to objects
Their techniques vastly outperforms standard TF-IDF based techniques