modeling variable dependencies between characters in chinese information retrieval
DESCRIPTION
Modeling Variable Dependencies between Characters in Chinese Information Retrieval. Lixin Shi, Jian-Yun Nie DIRO, University of Montreal. Outline. 1. Motivation 2. Related Work 3. Variable Dependency Model 4. Parameter Estimation 5. Experiment and Discussion - PowerPoint PPT PresentationTRANSCRIPT
Modeling Variable Dependencies between Characters in Chinese
Information Retrieval
Lixin Shi, Jian-Yun Nie
DIRO, University of Montreal
OutlineOutline
1. Motivation
2. Related Work
3. Variable Dependency Model
4. Parameter Estimation
5. Experiment and Discussion
6. Conclusion and Future Work
Modeling Variable Dependencies between Characters in Chinese IR 2
MotivationMotivation Two approaches to index Chinese texts:
― Character n-grams (unigram and bigram)― Segmented words
Traditional approaches assume independence among terms―Bag-of-words models
Current approaches typically combine different models with fixed weights.
― e.g.
3Modeling Variable Dependencies between Characters in Chinese IR 1. Motivation
bigramunigram )1(
In reality, terms are often dependent, and term dependencies do not have equal importance in IR. ― Strong: “hot dog”, “black Monday”.
―These dependencies should play an important role in IR
― Weak: “computer game”, “text printing” ―These dependencies should be considered weakly
Dependencies in Chinese IR even more important to consider―Characters can be strongly dependent.
京 (capital, Beijing), 九 (nine)京九 ( 北京九龙 ,Beijing-Kowloon)
―Weak dependency: 房 ,屋房屋 (house)
We try to capture various dependencies in our model ◦ use SVM to determine their weights.
Modeling Variable Dependencies between Characters in Chinese IR 1. Motivation 4
2. Related Work2. Related Work
• Previous studies often combine characters, bigrams and words
• In LM approach, a general way is as following:
where VR is vocabulary of type R (can be U, B,and W); λR is a fixed weight.
Modeling Variable Dependencies between Characters in Chinese IR 2. Related Work 6
Combining Different IndexesCombining Different Indexes
)|(log)|(),( DVw
QR wPwPDQScoreR
R
RR QDScoreDQScore ),(),(
Related Work in EnglishRelated Work in English Combining unigram model with Bigram and biterm models Markov Random Fields: An undirected graphic model that captures
the dependencies of terms within the same clique (fully connected nodes)―MRF-FD (Full Model): assumes that all terms are connected each other.
Leads to the problem of complexity for large cliques.
―MRF-SD (Sequential Model): considers that only adjacent terms are connected.
◦ use fixed weights for combination:◦ λT (Unigram),λO (ordered bigram),λU (Unordered bigram)
Weighted SD Model (WSD): a recently extended MRF, allows different weights of λO and λU depending on individual term pairs
◦ Limitations― Consider adjacent terms only
― Ordered and un-order term-pair uses same weight (i.e. λO = λU )
7Modeling Variable Dependencies between Characters in Chinese IR 2. Related Work
3. A New 3. A New Variable Variable Dependency Model Dependency Model
Discriminative ModelsDiscriminative ModelsThe model is defined within the framework of
discriminative modelsAllow us to selectively consider dependencies
between more distance characters, without having to increase the complexity to account for less useful dependencies.
The discriminative function can be a posterior probability or simply a confidence score
A typical discriminative model as:
9Modeling Variable Dependencies between Characters in Chinese IR 3. Variable Dependency Model
n
iii DQf
ZDQRelP ),(exp
1),|(
Modeling Variable Dependencies between Characters in Chinese IR 3. Variable Dependency Model
Our ModelOur Model
10
We integrate 3 types of features:Unigram: Ordered bigrams:Unordered co-occurrence dependency
within distance w: λB and λCw are the importance for a particular dependency
between a term pair. (λC is fixed to 1)
Ww jiQqqjiCjiC
QqqiiBiiB
QqiUiU
ji
ww
ii
i
DqqfQqq
DqqfQqq
DqfQqDQRelP
,
11
),,()|,(
),()|,(
),()|(),|(
1
),( Dqf iU
),( 1 Dqqf iiB
),,( Dqqf jiCw
Modeling Variable Dependencies between Characters in Chinese IR 3. Variable Dependency Model 11
)|},({log)|},({
)|(log)|(
)|(log)|(
11
DqqPQqqPf
DqqPQqqPf
DqPQqPf
wjiCjiCC
iiBiiBB
iUiUU
www
The discriminative function is defined by Cross-entropy of query language model and document language model.
We simply use maximum likelihood (ML) estimation for query model, and use Dirichlet smoothing for document language model (R is U, B or Cw).
R
RR
mlR Q
QtcQtP
||
);()|(
RR
RRRRRR D
CtPDtcDtP
||
)|();()|(
4. 4. Parameter EstimationParameter Estimation
Estimate: Dirichlet Prior (Estimate: Dirichlet Prior (μμ))
13Modeling Variable Dependencies between Characters in Chinese IR 4. Parameters Estimation
In the document language model estimation, if we use different window size W={2,4,8}
We have the following priors: μU, μB, μC2, μC4, μC8.
Intuitively and (confirmed by our preliminary experiments), a longer document expression (e.g.C8) leads to a higher sparsity. This will require a larger μ.
The μ’s are set roughly proportional to the document length in expression of U, B, C2, C4, C8:μU=1000, μB=1000, μC2=1000, μC4=3000,
μC8=7000.
Estimate: Dependency Strength(Estimate: Dependency Strength(λλs)s)
Learning process:(1) For each bigram and co-occurrence (xi) in training
queries, use a coordinate ascent search algorithm to find its best weight: xi λ*(xi)
(2) Extract a group of features xi for xi, then we get the train data: {(xi, λ*(xi)}
(3) Train SVM models for B, C2, C4, C8 respectively.
(4) For a new bigram or co-occurrence y in the query, we create the list of features and determine the weight using SVM
14Modeling Variable Dependencies between Characters in Chinese IR 4. Parameters Estimation
15
We use epsilon Support Vector Machine Regression (ε-SVR) for SVM training/learning.
We use the following features:― Point-wise mutual information in an independent text
collection: PMI_all(x)― PMI in the current test collection: PMI_coll(x)― A binary value according to the test:
PMI_all(x)>Threshold?― Binary test value: PMI_coll(x)> Threshold?― idf(x) - ifd(qi) - idf(qj)― (idf(x) - idf(qi) - idf(qj)) / (idf(qi) + idf(qj))― Does x appears in a Wikipedia Chinese title?― The distance between qi and qj
― …
In our experiments, we use 10-fold cross validation: 1/10 of the data is used in tune as test data while remain 9/10 as training data. Modeling Variable Dependencies between Characters in Chinese IR 4. Parameters Estimation
5. Experiment and Discussion5. Experiment and Discussion
Experimental settingExperimental setting
17
Coll. #docSize
(MB) Avg. doc.
length#Queries
Avg.Q. length
TREC5165 K 173 158
28 (Ch1-28) 4.7TREC6 26 (Ch29-54) 4.7TREC9 128 K 86 205 25 (Ch55-79) 3.7NTCIR4 382 K 543 226 59 (001-060) 4.3NTCIR5
901 K 1106 20750 (001-050) 4.6
NTCIR6 50 (003-110) 3.9
We convert all the characters into GB Simplified.Chinese texts are segmented to words by ICTCLAS and
LDC segmentation program.Use Indri to build the indexes of U, W, B, WU, BU, W+U,
B+U.Modeling Variable Dependencies between Characters in Chinese IR 5. Experiment and Discussion
The baselines (MAP) of traditional The baselines (MAP) of traditional Chinese IR modelsChinese IR models
U B BU B+U W WU W+U
TREC5 .3013 .2696 .3184 .3269 .2802 .3265 .3173
TREC6 .3601 .3610 .3875 .3878 .3881 .3983 .3998
TREC9 .2381 .2119 .2469 .2543 .1905 .2283 .2381
NTCIR4 .2371 .1995 .2243 .2489 .2237 .2396 .2469
NTCIR5 .3587 .3151 .3563 .3681 .3840 .3817 .3998
NTCIR6 .2695 .2448 .2931 .3064 .2739 .2863 .3012
18Modeling Variable Dependencies between Characters in Chinese IR 5. Experiment and Discussion
U: unigram; B: bigram, W: wordsBU: mixed bigram and unigram in a single indexB+U: the scores using B and U are interpolatedWU: mixed word and unigram in a single indexW+U: the scores using W and U are interpolated
The baselines of dependency The baselines of dependency models: MRF-SD and WSDmodels: MRF-SD and WSD
MRF-SD Weighted-SD
MAP cf. U cf.B+U cf. W+U MAP cf. U cf. SD
TREC5 .3271 +8.6%‡ +3.1% +3.1% .3279 +8.8%‡ +0.2%
TREC6 .3899 +8.3%‡ +0.6% -2.5% .3780 +5.0% -3.1%
TREC9 .2576 +8.2% +1.3% +8.2% .2732 +14.8%† +6.0%
NTCIR4 .2490 +5.0%† +0.0% +0.8% .2514 +6.0%‡ +1.0%
NTCIR5 .3846 +7.2 %‡ +4.5% -3.8% .3909 +9.0%† +1.6%
NTCIR6 .3066 +13.8%‡ +0.0% +1.8% .3088 +14.6%‡ +0.7%
19
†:T-test<.05 ‡:T-test<.01
Modeling Variable Dependencies between Characters in Chinese IR 5. Experiment and Discussion
The result of our VDM
20
VDM (10-fold cross-validation)VDM(ideal)
MAP cf. U cf. B+U cf. W+U cf. SD cf.WSD MAP
TREC5 .3501 +16.2%‡ +7.1%‡ +10.4%‡ +7.1%‡ +6.8%† .4414
TREC6 .4159 +15.5%‡ +7.3%‡ +4.0%† +6.7%‡ +10.0%‡ .5272
TREC9 .2713 +14.0% +6.7% +14.0% +5.3% -0.7% .3896
NTCIR4 .2613 +10.2%‡ +5.0%‡ +5.8%‡ +4.9%‡ +3.9%‡ .3494
NTCIR5 .3949 +10.1%‡ +7.3%† -1.2% +2.7% +1.0% .5261
NTCIR6 .3142 +16.6%‡ +2.5%† 4.3%† +2.5%† +1.7% .4126
• Our model (VDM) outperforms all the baseline methods except in two cases. Many of the improvements are statistically significant.
• Ideal parameters largely outperform the existing models.
Examples for Ideal Various WeightsExamples for Ideal Various Weights
21
.01 .35impact
1986
immigr.
law
.01 .10
.35
.01
.03
.05
.4 .9 .90
.9 .5
.6
.7.5
.07
bico2co4co8
.35 .01.9
.7
东中 和 平 会 议
Modeling Variable Dependencies between Characters in Chinese IR 5. Experiment and Discussion
6. Conclusion and Future Work6. Conclusion and Future Work
ConclusionConclusionWe propose a model to take into account the
relationships between different types of index. In our model, a pair of characters is used in the retrieval
model according to its strength and usefulness for IR.The assignment of variable weights to pairs of
characters has not been investigated in previous studies.
Our experiments showed that the integration of term dependencies with variable weights can lead to higher effectiveness.
The model we propose in this paper points to an interesting direction for future research – the integration of dependencies according to their usefulness in IR.
23Modeling Variable Dependencies between Characters in Chinese IR 6. Conclusion and Future Work
Future WorkFuture Work
We have not exploited all the potential of the model. Server aspects could be further improved:―It would be possible to extend dependencies of pairs of characters to more characters.― Using a larger amount of training data (such as user click-throughs) to correctly learn the weights.
24Modeling Variable Dependencies between Characters in Chinese IR 6. Conclusion and Future Work
Thanks
QuestionsQuestions??