modeling variable dependencies between characters in chinese information retrieval

Modeling Variable Dependencies between Characters in Chinese

Information Retrieval

Lixin Shi, Jian-Yun Nie

DIRO, University of Montreal

OutlineOutline

1. Motivation

2. Related Work

3. Variable Dependency Model

4. Parameter Estimation

5. Experiment and Discussion

6. Conclusion and Future Work

Modeling Variable Dependencies between Characters in Chinese IR 2

MotivationMotivation Two approaches to index Chinese texts:

― Character n-grams (unigram and bigram)― Segmented words

Traditional approaches assume independence among terms―Bag-of-words models

Current approaches typically combine different models with fixed weights.

― e.g.

3Modeling Variable Dependencies between Characters in Chinese IR 1. Motivation

bigramunigram )1(

In reality, terms are often dependent, and term dependencies do not have equal importance in IR. ― Strong: “hot dog”, “black Monday”.

―These dependencies should play an important role in IR

― Weak: “computer game”, “text printing” ―These dependencies should be considered weakly

Dependencies in Chinese IR even more important to consider―Characters can be strongly dependent.

京 (capital, Beijing), 九 (nine)京九 ( 北京九龙 ,Beijing-Kowloon)

―Weak dependency: 房 ,屋房屋 (house)

We try to capture various dependencies in our model ◦ use SVM to determine their weights.

Modeling Variable Dependencies between Characters in Chinese IR 1. Motivation 4

2. Related Work2. Related Work

• Previous studies often combine characters, bigrams and words

• In LM approach, a general way is as following:

where VR is vocabulary of type R (can be U, B,and W); λR is a fixed weight.

Modeling Variable Dependencies between Characters in Chinese IR 2. Related Work 6

Combining Different IndexesCombining Different Indexes

)|(log)|(),( DVw

QR wPwPDQScoreR

R

RR QDScoreDQScore ),(),(

Related Work in EnglishRelated Work in English Combining unigram model with Bigram and biterm models Markov Random Fields: An undirected graphic model that captures

the dependencies of terms within the same clique (fully connected nodes)―MRF-FD (Full Model): assumes that all terms are connected each other.

Leads to the problem of complexity for large cliques.

―MRF-SD (Sequential Model): considers that only adjacent terms are connected.

◦ use fixed weights for combination:◦ λT (Unigram),λO (ordered bigram),λU (Unordered bigram)

Weighted SD Model (WSD): a recently extended MRF, allows different weights of λO and λU depending on individual term pairs

◦ Limitations― Consider adjacent terms only

― Ordered and un-order term-pair uses same weight (i.e. λO = λU )

7Modeling Variable Dependencies between Characters in Chinese IR 2. Related Work

3. A New 3. A New Variable Variable Dependency Model Dependency Model

Discriminative ModelsDiscriminative ModelsThe model is defined within the framework of

discriminative modelsAllow us to selectively consider dependencies

between more distance characters, without having to increase the complexity to account for less useful dependencies.

The discriminative function can be a posterior probability or simply a confidence score

A typical discriminative model as:

9Modeling Variable Dependencies between Characters in Chinese IR 3. Variable Dependency Model

n

iii DQf

ZDQRelP ),(exp

1),|(

Modeling Variable Dependencies between Characters in Chinese IR 3. Variable Dependency Model

Our ModelOur Model

10

We integrate 3 types of features:Unigram: Ordered bigrams:Unordered co-occurrence dependency

within distance w: λB and λCw are the importance for a particular dependency

between a term pair. (λC is fixed to 1)

Ww jiQqqjiCjiC

QqqiiBiiB

QqiUiU

ji

ww

ii

i

DqqfQqq

DqqfQqq

DqfQqDQRelP

,

11

),,()|,(

),()|,(

),()|(),|(

1

),( Dqf iU

),( 1 Dqqf iiB

),,( Dqqf jiCw

Modeling Variable Dependencies between Characters in Chinese IR 3. Variable Dependency Model 11

)|},({log)|},({

)|(log)|(

)|(log)|(

11

DqqPQqqPf

DqqPQqqPf

DqPQqPf

wjiCjiCC

iiBiiBB

iUiUU

www

The discriminative function is defined by Cross-entropy of query language model and document language model.

We simply use maximum likelihood (ML) estimation for query model, and use Dirichlet smoothing for document language model (R is U, B or Cw).

R

RR

mlR Q

QtcQtP

||

);()|(

RR

RRRRRR D

CtPDtcDtP

||

)|();()|(

4. 4. Parameter EstimationParameter Estimation

Estimate: Dirichlet Prior (Estimate: Dirichlet Prior (μμ))

13Modeling Variable Dependencies between Characters in Chinese IR 4. Parameters Estimation

In the document language model estimation, if we use different window size W={2,4,8}

We have the following priors: μU, μB, μC2, μC4, μC8.

Intuitively and (confirmed by our preliminary experiments), a longer document expression (e.g.C8) leads to a higher sparsity. This will require a larger μ.

The μ’s are set roughly proportional to the document length in expression of U, B, C2, C4, C8:μU=1000, μB=1000, μC2=1000, μC4=3000,

μC8=7000.

Estimate: Dependency Strength(Estimate: Dependency Strength(λλs)s)

Learning process:(1) For each bigram and co-occurrence (xi) in training

queries, use a coordinate ascent search algorithm to find its best weight: xi λ*(xi)

(2) Extract a group of features xi for xi, then we get the train data: {(xi, λ*(xi)}

(3) Train SVM models for B, C2, C4, C8 respectively.

(4) For a new bigram or co-occurrence y in the query, we create the list of features and determine the weight using SVM

14Modeling Variable Dependencies between Characters in Chinese IR 4. Parameters Estimation

15

We use epsilon Support Vector Machine Regression (ε-SVR) for SVM training/learning.

We use the following features:― Point-wise mutual information in an independent text

collection: PMI_all(x)― PMI in the current test collection: PMI_coll(x)― A binary value according to the test:

PMI_all(x)>Threshold?― Binary test value: PMI_coll(x)> Threshold?― idf(x) - ifd(qi) - idf(qj)― (idf(x) - idf(qi) - idf(qj)) / (idf(qi) + idf(qj))― Does x appears in a Wikipedia Chinese title?― The distance between qi and qj

― …

In our experiments, we use 10-fold cross validation: 1/10 of the data is used in tune as test data while remain 9/10 as training data. Modeling Variable Dependencies between Characters in Chinese IR 4. Parameters Estimation

5. Experiment and Discussion5. Experiment and Discussion

Experimental settingExperimental setting

17

Coll. #docSize

(MB) Avg. doc.

length#Queries

Avg.Q. length

TREC5165 K 173 158

28 (Ch1-28) 4.7TREC6 26 (Ch29-54) 4.7TREC9 128 K 86 205 25 (Ch55-79) 3.7NTCIR4 382 K 543 226 59 (001-060) 4.3NTCIR5

901 K 1106 20750 (001-050) 4.6

NTCIR6 50 (003-110) 3.9

We convert all the characters into GB Simplified.Chinese texts are segmented to words by ICTCLAS and

LDC segmentation program.Use Indri to build the indexes of U, W, B, WU, BU, W+U,

B+U.Modeling Variable Dependencies between Characters in Chinese IR 5. Experiment and Discussion

The baselines (MAP) of traditional The baselines (MAP) of traditional Chinese IR modelsChinese IR models

U B BU B+U W WU W+U

TREC5 .3013 .2696 .3184 .3269 .2802 .3265 .3173

TREC6 .3601 .3610 .3875 .3878 .3881 .3983 .3998

TREC9 .2381 .2119 .2469 .2543 .1905 .2283 .2381

NTCIR4 .2371 .1995 .2243 .2489 .2237 .2396 .2469

NTCIR5 .3587 .3151 .3563 .3681 .3840 .3817 .3998

NTCIR6 .2695 .2448 .2931 .3064 .2739 .2863 .3012

18Modeling Variable Dependencies between Characters in Chinese IR 5. Experiment and Discussion

U: unigram; B: bigram, W: wordsBU: mixed bigram and unigram in a single indexB+U: the scores using B and U are interpolatedWU: mixed word and unigram in a single indexW+U: the scores using W and U are interpolated

The baselines of dependency The baselines of dependency models: MRF-SD and WSDmodels: MRF-SD and WSD

MRF-SD Weighted-SD

MAP cf. U cf.B+U cf. W+U MAP cf. U cf. SD

TREC5 .3271 +8.6%‡ +3.1% +3.1% .3279 +8.8%‡ +0.2%

TREC6 .3899 +8.3%‡ +0.6% -2.5% .3780 +5.0% -3.1%

TREC9 .2576 +8.2% +1.3% +8.2% .2732 +14.8%† +6.0%

NTCIR4 .2490 +5.0%† +0.0% +0.8% .2514 +6.0%‡ +1.0%

NTCIR5 .3846 +7.2 %‡ +4.5% -3.8% .3909 +9.0%† +1.6%

NTCIR6 .3066 +13.8%‡ +0.0% +1.8% .3088 +14.6%‡ +0.7%

19

†:T-test<.05 ‡:T-test<.01

Modeling Variable Dependencies between Characters in Chinese IR 5. Experiment and Discussion

The result of our VDM

20

VDM (10-fold cross-validation)VDM(ideal)

MAP cf. U cf. B+U cf. W+U cf. SD cf.WSD MAP

TREC5 .3501 +16.2%‡ +7.1%‡ +10.4%‡ +7.1%‡ +6.8%† .4414

TREC6 .4159 +15.5%‡ +7.3%‡ +4.0%† +6.7%‡ +10.0%‡ .5272

TREC9 .2713 +14.0% +6.7% +14.0% +5.3% -0.7% .3896

NTCIR4 .2613 +10.2%‡ +5.0%‡ +5.8%‡ +4.9%‡ +3.9%‡ .3494

NTCIR5 .3949 +10.1%‡ +7.3%† -1.2% +2.7% +1.0% .5261

NTCIR6 .3142 +16.6%‡ +2.5%† 4.3%† +2.5%† +1.7% .4126

• Our model (VDM) outperforms all the baseline methods except in two cases. Many of the improvements are statistically significant.

• Ideal parameters largely outperform the existing models.

Examples for Ideal Various WeightsExamples for Ideal Various Weights

21

.01 .35impact

1986

immigr.

law

.01 .10

.35

.01

.03

.05

.4 .9 .90

.9 .5

.6

.7.5

.07

bico2co4co8

.35 .01.9

.7

东中和平会议

Modeling Variable Dependencies between Characters in Chinese IR 5. Experiment and Discussion

6. Conclusion and Future Work6. Conclusion and Future Work

ConclusionConclusionWe propose a model to take into account the

relationships between different types of index. In our model, a pair of characters is used in the retrieval

model according to its strength and usefulness for IR.The assignment of variable weights to pairs of

characters has not been investigated in previous studies.

Our experiments showed that the integration of term dependencies with variable weights can lead to higher effectiveness.

The model we propose in this paper points to an interesting direction for future research – the integration of dependencies according to their usefulness in IR.

23Modeling Variable Dependencies between Characters in Chinese IR 6. Conclusion and Future Work

Future WorkFuture Work

We have not exploited all the potential of the model. Server aspects could be further improved:―It would be possible to extend dependencies of pairs of characters to more characters.― Using a larger amount of training data (such as user click-throughs) to correctly learn the weights.

24Modeling Variable Dependencies between Characters in Chinese IR 6. Conclusion and Future Work

Thanks

QuestionsQuestions??

modeling variable dependencies between characters in chinese information retrieval

Documents

dependencies of terms

term dependencies

chinese ir

useful dependencies

various dependencies

distance characters

englishcombining unigram

typical discriminative