matching reviews to objects using a language model€¦ · information retrieval(ir) –to match...

Matching Reviews

to Objects using a Language Model

Nilesh Dalvi, Ravi Kumar, Bo Pang, Andrew Tomkins

Yahoo! Research

ACL and AFNLP, 2009

15 April, 2014

Jaehwan Lee

2 / 23

Outline

Introduction

Related Work

Model and Method

Data

Evaluation

Conclusions

3 / 23

Introduction

the Search Engine would like

– to offer a high quality result set for even obscure restaurants

– to enable advanced applications and recommendation

To solve them, It faces two high-level challenges

– identify the restaurant review pages on the Web

– identify the restaurant that is being reviewed

Notice

– restaurant reviews are running example

– “the techniques are general”

4 / 23

Introduction

the Search Engine would like

– to offer a high quality result set for even obscure restaurants

– to enable advanced applications and recommendation

To solve them, It faces two high-level challenges

– identify the restaurant review pages on the Web

– identify the restaurant that is being reviewed

Notice

– restaurant reviews are running example

– “the techniques are general”

5 / 23

Introduction

Two Settings of Related Flavor

Entity Matching

– to find the correspondence between two structured objects

Information Retrieval(IR)

– to match unstructured short text against unstructured text

6 / 23

Introduction

Classical IR Methods Doesn’t Fit

Example of “Food”

– “food” is rare as a restaurant name

– thus, it will get a very high IDF score

– AND hence will likely be the top match for all reviews containing the word “food”

UNLIKE in traditional IR

– a query (i.e. review) is long and a document (i.e. restaurant) is short

document reviewquery restaurant

traditional IR matching Review

7 / 23

Introduction

Our Their Contributions

The intuition behind their model is simple and natural

– When a review is written about an object,

– each word in the review is drawn either from a description of the object or from a generic review language that is independent of the object

review

Word 1

Word n

…

generic review language(Object-independent)

description of object(Object-dependent)

8 / 23

Related Work

Opinion topic identification

– Some work on fine-grained opinion extraction from reviews

– focused on identifying product features of the object under review, rather than object itself

Language modeling

– to postulate a model for each document

– to select the document that is most likely to have generated for a given query

Entity matching

– consider pairwise attribute similarities between entities

– exploit the relationships that exist between entities

9 / 23

Model and Method

r : a review

: a collection of reviews

e : an object, has a set of attributes

: a set of objects

text(e) : the union of the textual content of all its attributes

: the probability the word w is chosen according some object-independent distribution

: the probability the word w is chosen according some object-dependent distribution

E

R

10 / 23

Model and Method

Review Language Model (RLM)

It represent the probability that a review r is a review about object e when e exists in r

alpha is a parameter (0 < alpha < 1)

Modeling

– is object-dependent

– is object-independent (generic review feature)

11 / 23

Model and Method


It can be zero, if a word w is not in text(e)

Thus, have to modify the equation as following

12 / 23

Model and Method


By assuming a uniform distribution for Pr[e], we get

13 / 23

Model and Method


By assuming a uniform distribution for Pr[e], we get

How?

14 / 23

Model and Method


Object-independent factor

– By treating the set of processed reviews where for each review-object pair (r, e), words in text€ are remove from r as an approximation of

– Then, we can compute in the aforementioned manner

Object-dependent factor

– By using the frequency fw of the word w in or in

R(g)

R

15 / 23

Model and Method

RLM, TFIDF and TFIDF+

Generic equation

for RLM, f(w) goes

for TFIDF and TFIDF+, f(w) goes document query

TFIDF+

16 / 23

Data

299,762 reviews

– each aligned with one of a set of 12,408 unique restaurants hosted on Yelp (yelp.com)

– no more than 40 reviews per each restaurants

681,320 restaurants from Yahoo! Local database

Task

– to match a given Yelp review, using ONLY its free-form textual content

17 / 23

Data

The Final Aligned Dataset

– 24,910 Yelp reviews covering 6,010 restaurants

– to estimate the models

– reviews filtered out because of lack of identifying information were added

– 205,447 reviews

– to evaluate RLM

– 11,217 reviews

There are no overlapping restaurants between them

Rtest

R’

R

18 / 23

Evaluation

Unlike a standard IR task

– not interested in retrieving multiple relevant objects

– each review in dataset has only one single correct match from

Macro vs. micro average

– Macro average

first, compute the average for reviews about the same restaurant

and report the average over all restaurants

– micro average

take the average accuracy over all reviews

Accuracy @ k

– consider a review is correctly matched if one of the top-k objects returned is the correct match

E

19 / 23

Evaluation

Main Result

20 / 23

Evaluation

Main Result

21 / 23

Evaluation

Main Result

Longer reviews might be more difficult to match since they may include more proper nouns such as dish names and related restaurants, and yield a longer list of highly competitive candidate objects.

22 / 23

Evaluation

Main Result

Choices for RLM

– RLM-Uniform

– RLM-Uncut

– RLM-Decap

Revisiting TFIDF+

– Object Length Normalization

– Dampening

– Removing mentions of objects

Using term counts

– each of the other modeling decisions incorporated in RLM is important

23 / 23

Conclusions

The model provides us a principled way to match reviews to objects

Their techniques vastly outperforms standard TF-IDF based techniques

matching reviews to objects using a language model€¦ · information retrieval(ir) –to match...

Documents