integrating word relationships into language models

Integrating Word Relationships into Language Models

Guihong Cao , Jian-Yun Nie , Jing BaiDépartment ďInformatique et de Recherche Opérati

onnelle,Université de Montréal

Presenter : Chia-Hao Lee

Outline• Introduction

• Previous Work

• A Dependency Model to Combine WordNet and Co-occurrence

• Parameter estimation– Estimating conditional probabilities – Estimating mixture weights

• Experiments

• Conclusion and feature work

Introduction• In recent years, language models for information retrieval

(IR) have increased in popularity.

• The basic idea behind is to compute the conditional probability .

• In most approaches, the computation is conceptually decomposed into two distinct steps:– (1) Estimating the document model– (2) Computing the query likelihood using the estimated document

model

DQP

• When estimating the document model, the words in the document are assumed to be independent with respect to one another, leading to the so called “bag-of-word” model.

• However, from our own knowledge of natural language, we know that the assumption of term independence is a matter of mathematical convenience rather than a reality.

• For example, the words “computer” and “program” are not independent. A query requesting for “computer” might be well satisfied by a document about “program”.

Introduction (cont.)

• Some studies have been carried out to relax the independence assumption.

• The first one is data-driven, which tries to capture dependency among terms by statistical information derived from the corpus directly.

• Another direction is to exploit hand-crafted thesauri, such as WordNet.

Introduction (cont.)

Previous Work• In classical language modeling approach to IR, a multinomial

model over terms is estimated for each document in the collection to be indexed and searched.

• In most cases, each query term is assumed to be independent of the others, the query likelihood is estimated by .

• After the specification of a document prior ,the posteriori probability of a document is given by:

dwP dC

n

ii dqPdqP

1

dP

1dPdqPqdP

• However the classical language model approach for IR does not address the problem of dependence between words.

• The term “dependence” may mean two different things:– Dependence between words within a query or within a document– Dependence between query words and document words

• The first meaning, one may try to recognize the relationships between words in sentence.

• Under the second meaning, dependence means any relationship that can be exploited during query evaluation.

Previous Work (cont.)

• The incorporate term relationships into the document language model, we propose a translation model .

• With the translation model, the document-to-query model becomes:

• Even though their model is general than other language models, it is different to determine the translation probability

in practice.

• To solve this problem, we generate an artificial collection of “synthetic” data for training by assuming that a sentence is parallel to the paragraph that contains the sentence.

Previous Work (cont.)

wqt i

21

n

i wi dwPwqtdqP

wqt i

A Dependency Model to Combine WordNet and Co-occurrence

• Given a query q and a document d, the query can be related directly, or they can be related indirectly through some word relationships.

• An example of the first case is that the document and the query contain the same words.

• In the second case, a document can contain a different word, but synonymous or related to the one in the query.

• In order to take both cases into our modeling, we assume that there are two sources to generate a term from a document: one from a dependency model and another from a non-dependency model.

:the parameter of dependency model

:the parameter of non-dependency model

A Dependency Model to Combine WordNet and Co-occurrence (cont.)

n

ii dqPdqP

1

n

iDiDi dqPdqP

1

,,

3,,1

n

iDDiDDi dPdqPdPdqP

DD

• The non-dependency model tries to capture the direct generation of the query by the document, we can model it by unigram document model:

• Then, we select a term in the document randomly first.• Second, a query term is generated based on the

observed term. Therefore we have:


dUPdqPdPdqP iUDDi ,

4,,

dw

DiDi dwPwqPdqP

dqP iU :the probability of unigram model

• As for the translation model, we also have the problem of estimating the dependency between two term, i.e.

• To address the problem, we assume that some word relationships have been manually identified and stored in a linguistic resource, and some other relationships have to be found automatically according to co-occurrences.


wqP i

• So, this combination can be achieved by a linear interpolation smoothing. Thus:

• In our study, we only consider co-occurrence information beside WordNet.

• So, is just the co-occurrence model.


5,1, wLqPwLqPwqP iii

:the conditional probability of given according to WordNet.iq w wLqP i ,

:the probability that the link between and is achieved by other means.

iq w wLqP i ,

:the interpolation factor, which can be considered as a two-component model.

wLqP i ,

• For the simplicity of expression, we denote probability of link model as , i.e. , and the co-occurrence model as .

• Substitute Equations 4 and 5 into 3, we obtain Equation 6:


wqP iL wLqPwqP iiL , wLqPwqP iiCO ,

n

iiUDDi dUPdqPdPdqPdqP

1

,

n

iiUD

dwDi dUPdqPdPdwPwqP

1

,

6,1,1

n

iiU

dwDiCOD

dwDiLD dUPdqPdwPwqPdPdwPwqPdP

dw

Dii dwPwLqPwLqP ,,1,

dw

Di dwPwqP ,

dw

Didw

Di dwPwqPdwPwqP ,1,


+

+

+ + ++

• The idea can become more obvious if we make some simplification in the formula.

• So, we can get:


7,

dw

DiLiL dwPwqPdqP

8,

dw

DiCOiCO dwPwqPdqP

911

n

iiUiCODiLD dUPdqPdqPdPdqPdPdqP

consisting of link model, co-occurrence model and unigram model

• Let , , denote the respect weights of link model, co-occurrence model, and unigram model.

• Then equation 9 can be rewritten as:

• For information retrieval, the most important terms are nouns. So, we concentrate on three relations related to nouns: synonym, hypernym, and hyponym.


L CO U

101

n

iiUUiCOCOiLL dqPdqPdqPdqP

dP DL dP DCO 1 dUPU

111

54321

n

iiUiCOiHYPOiHYPEiSYN dqPdqPdqPdqPdqPdqP

NSLM

SLM

Parameter estimation• 1.Estimating conditional probabilities

– The unigram model ,we use the MLE estimation, smoothed by interpolated absolute discount, that is:

dwP iU

120,;maxCwP

d

d

d

dwcdwP iMLE

uiiabs

:the discount factor :the length of the document d:the count of unigram term in the document u

d:the maximum likelihood probability of the word in the collection

CwP iMLE

(related to D)

• For , it can be approximated by the maximum likelihood probability .

• This approximation is motivated by the fact that the word is primarily generated from in a way quite independent from the model .

• The estimation of - the probability of link between two words according to WordNet.

Parameter estimation (cont.) DdwP ,

Dd

w

dwPMLE

wwP iL

• Equation 13 defines our estimation of by interpolated Absolute discount:

Parameter estimation (cont.) wwP iL

LWwP

LWwwc

LWwc

LWwwc

LWwwcwwP ioneadd

wj

wj

iiL

jj

,,,

,*,

,,

0,,,max

131,,

1,,,

1 1

1

v

i

v

j ji

v

j ji

ioneaddLWwwc

LWwwcLWwP

and are assumed to have a relationship in WordNetwiw

LWwC ,*, :the number of unique terms which have a relationship with in WordNet and co-occur with it in .W

iw:the count of co-occurrences of with within the predefined window iw w LWwwC i ,,

?

• The estimation of the components of the co-occurrence model is similar to those of the link model expect that that when counting the co-occurrence frequency, the requirement of having a link in WordNet is removed.

Parameter estimation (cont.)

dwP ico dwP iL

WwP

Wwwc

Wwc

Wwwc

WwwcwwP ioneadd

wj

wj

iiCO

jj

,

*,

,

0,,max

141,

1,

1 1

1

v

i

v

j ji

v

j ji

ioneaddWwwc

WwwcWwP

• 2.Estimating mixture weights

We introduce an EM algorithm to estimate the mixture weights in NSLM.

Because NSLM is a three-component mixture model, the optimal weights should maximize the likelihood of the queries.

Let be the mixture weights, we then have:


15logmaxarg11

*

m

jijCOCOijLLijUU

N

iiq dqPdqPdqP

q

UCOLq ,,

:the number of documents in the datasetNm :the length of query q

:the prior probability with which to choose the document to generate the query Nii 1

• However, some documents having high weights are not truly relevant to the query. They contain noise.

• To account for the noise, we further assume that there are two distinctive sources to generate the query.

• One is the relevant documents, another is a noisy source, which is approximated by the collection C


:respectively unigram model, link model and co-occurrence model built from the collection

CqP jU

CqP jL

CqP jCO

:the weight of the noise

16

1

logmaxarg

1

11*

m

jjCOCOjLLjUU

m

jijCOCOijLLijUU

N

ii

q

CqPCqPCqP

dqPdqPdqP

q

smoothing

• With this setting, the hidden and can be estimated using the EM algorithm.

• The update formulas are as follows:

Parameter estimation (cont.) Nii 1 q

17

11

11

m

jijCO

rCOijL

rLijU

rU

N

i

ri

m

jijCO

rCOijL

rLijU

rU

ri

ri

dqPdqPdqP

dqPdqPdqP

CqPCqPCqPdqPdqPdqP

CqPdqP

mjCO

rCOjL

rLjU

rU

N

i ijCOrCOijL

rLijU

rU

ri

N

i jLrLijL

rL

rir

U

1

11

1

11

CqPCqPCqPdqPdqPdqP

CqPdqP

mjCO

rCOjL

rLjU

rU

N

i ijCOrCOijL

rLijU

rU

ri

N

i jCOrCOijCO

rCP

rir

U

1

11

1

11

CqPCqPCqPdqPdqPdqP

CqPdqP

mjCO

rCOjL

rLjU

rU

N

i ijCOrCOijL

rLijU

rU

ri

N

i jUrUijU

rU

rir

U

1

11

1

11

Experiments We evaluated our model described in the previous sections using three different TREC collections – WSJ,AP and SJM

Experiments (cont.)

Conclusion and feature work• In this paper, we integrate word relationships into the

language modeling framework.

• We used EM algorithm to train the parameters. This method worked well for our experiment.

integrating word relationships into language models

Documents

query model

document language model

query words

query term

document wordsthe

document prior

document model2

document d