surveys of some critical issues in chinese indexing
DESCRIPTION
Surveys of Some Critical Issues in Chinese Indexing. Chinese Document Indexing and Word Segmentation. Speaker : Reuy-Lung, Hsiao Date : Wed, Dec, 22. Roadmap. An overview of Web Information Retrieval systems architecture Automatic indexing overview Questions of Chinese document indexing - PowerPoint PPT PresentationTRANSCRIPT
Surveys of Some Critical Issues in Chinese Indexing
Chinese Document Indexing and Word Segmentation
• Speaker : Reuy-Lung, Hsiao• Date : Wed, Dec, 22
Roadmap
1. An overview of Web Information Retrieval systems architecture
2. Automatic indexing overview3. Questions of Chinese document indexing4. Typical approaches to index Chinese
document sets5. Chinese words segmentation mechanism6. Segmentation algorithms7. Discussion and Conclusion8. Reference
System Overview
Information Discovery
IndexingIndexingIndexingIndexing
Index Database
Similarity MeasurementSimilarity Measurement(Ranking)(Ranking)
Similarity MeasurementSimilarity Measurement(Ranking)(Ranking)
Request Response
Result Document Set
QueryFormulation
Chinese Document Indexing
Automatic Indexing Overview
1. Automatic indexing mechanism extracts the features (terms or keywords) of a given document.2. Indexing processes may contains the follow steps: (1)Morphological & Lexical Analysis stemming -> stop list -> weighting -> thesaurus construction (2)Syntactic & Semantic Analysis part-of-speech tagging -> information extraction -> concept extraction3. Weighting plays an important role in retrieval effectiveness. (1)Typical term weighting mechanism : TFxIDF. (2)Typical effectiveness measurement : recall,precision.
Automatic Indexing Overview
4. TFxIDF
5. Recall/Precision Recall =
Precision =
w tfN
dfij ij
j
log
# relevant document
# retrieved relevant document
# retrieved document
# relevant relevant document
Relevance line
Retrieval line
BD
A
C
B+CB
A+BB
=
=
Questions of Chinese Document Indexing
1.Words, rather than characters, should be the smallest indexing unit.
• More specific to the concepts• Less index space required
2.A comprehensive lexicon is needed.3.Chinese text has no delimiters to mark word boundary. for example:English words have spaces and punctuations as separators
中文句子沒有明顯的分隔符號
Approaches to indexing Chinese Text
1.N-gram Indexing• Typically use N = 1,2,3• Produce large index file
2.Statistical Indexing • Typically use mutual information for wo
rd corelation3.Word-based Indexing
• Rule-based approach• Statistical approach• Hybrid approach
Approaches to indexing Chinese Text (N-gram Indexing)
• N-gram indexing terms produced from the same text string
sentence
unigram
bigram
trigram
C1C2C3C4C5C6
C1 , C2 , C3 , C4 , C5 , C6
C1 C2 , C2 C3 , C3 C4 , C4 C5 , C5 C6
C1 C2 C3 , C2 C3 C4 , C3 C4 C5 , C4 C5 C6
• N-gram index size for TREC-5 Chinese collection
n-gram
unigram
bigram
trigram
# distinct n-grams # of n-grams
6,236
1,393,488
8,119,574
64,611,662
54,362,319
49,886,331
Approaches to indexing Chinese Text (Statistical Indexing)
• Mutual Information I(x,y) between two events x and y is defined as
P(x,y)
P(x)P(y)I(x,y) = log2
• If two events occur independently, p(x,y) would be close to p(x)p(y), I(x,y) would be closed to zero.• If two events are strongly related, p(x,y) would be much larger than p(x)p(y), I(x,y) would be large• Using statistical counting to derive probability
P(C1,C2) = P(C1)P(C2|C1) = N f(c1)
f(C1) f(C1C2)
I(C1,C2) = log2 N + log2
=N
f(C1C2)
f(C1C2)
f(C1)f(C2)
Approaches to indexing Chinese Text (Statistical Indexing)
• Statistical Indexing Algorithm
1. Compute the mutual information values for all adjacent bigrams.2. Treat the bigram of the largest mutual information value as a word and remove it from the text.3. Perform step 2 on each short phrases until all phrases consist of one or tow characters.
• The following statistics are based on text collections from China Times, from 12/19/99, 12/20/99, 12/21/99.• Totally 621079 characters, 3827 distinct characters per day on average.• Comparsion among above indexing methods. (result)
Approaches to indexing Chinese Text (Statistical Indexing)
bigram
f(C1)
f(C2)
f(C1C2)
I(C1,C2)
宣言連戰 戰新 新的 的競 競選 選宣
Step phrases action
連戰新的競選宣言
543
517
76
5.12
517
1498
0
-7.13
1498
16187
80
0.72
16187
223
34
1.77
223
1028
61
5.11
1028
259
2
1.54
259
305
8
4.14
1 remove 連戰□□新的競選宣言2 remove 競選□□新的□□宣言3 remove 宣言□□新的□□□□4 remove 新的
other example
Approaches to indexing Chinese Text (Word-based Indexing)
1.Rule-based approach• Use a dictionary(lexicon) to match words.• Concept: a correct segmentation result should consist of legitimate words. For example: 中國文學 1.中國 文學 2.中國 文 學 3.中 國文 學 4.中 國 文學 5.中 國 文 學 We will choose (1) as the result.• Out-of-Vocabulary problem.
2.Statistical approach• Rely on statistical information such as word and character (co-)occurrence frequencies in the training data.• Concepts Given an sentence,the best solution is composed of a sequence of potential words Si, such that is the highest.
• Supervised/Unsupervised learning• Require large data to acquire accuracy.• Sparse data problem
Approaches to indexing Chinese Text (Word-based Indexing)
i
isP )(
Approaches to indexing Chinese Text (Segmentation Algo.)
• Hybrid Segmentation Algorithm by Jian-Yun Nie, Martin Brisebois, SIGIR ‘96• Use lexicon and statistical information to segment words, with morphological heuristic rule to augment lexicon coverage. (note:supervised learning)• Terminology:
•background knowledge: words contained in the dictionary with default probability (p)•foreground knowledge: statistical information•heuristic rule: two kind of rules are included
•Nominal pre-determiner structure such as 這一年、一百本、每一天•Affix structure such as 小朋友、大眾化
Approaches to indexing Chinese Text (Segmentation Algo.)
• Algorithm:•Combination of both knowledge if statistical information is available, use it! else background knowledge is taken into account.•Each character in the input string is associated with all the candidate words starting from that character, together with their probability•The candidate words are combined to cover the input string. The word sequence having the highest probability is chosen as the result.
• Example:大會決議和議程項目 (Result)大 會 決 議 和 議 程 項 目
大會 決議 議和 和議 議程 項目大會 決議
和
議程 項目(0.016) (0.029) (0.00108) (0.0005) (0.945) (0.0005) (0.0005) (0.0005) (0.0024)
(1.0) (0.956) (0.001) (0.001) (1.0) (0.936)
Approaches to indexing Chinese Text (Segmentation Algo.)
• Unsupervised Segmentation Algorithm by Xiaoqiang Luo, Salim Roukos, ACL ‘96• Pure statistical learning model without using dictionary. It divides training set into two parts, randomly segments part-one, and segment part two by part one.• Use the previously-constructed language Model for iteration.• Use Viterbi-like algorithm to build LM.• Concept:
Let a sentence S = C1C2C3..Cn-1Cn, where Ci(1≦i≦n) is a Chinese character.To segment a sentence into words is to group these charac-ters into words, i.e.
S = C1C2...Cn-1Cn = (C1...Cx1)(Cx1+1
...Cx2)...(Cxm-1+1
...Cxm)
= W1W2...Wm where xk is the index of the last character in kth word Wk, i.e. Wk=Cxk-1+1
...Cxk (k=1..m), and x0=0,xm=n
• A segmentation of the sentence S can be uniquely represented by an integer sequence X1,...,Xm, so we denote all possible segmentation by
G(S)={ (x1...xm)|1≦x1≦...≦xm,m≦n }
and assign a score for a segmentation g(S)=(x1...xm)G(S) by
Approaches to indexing Chinese Text (Segmentation Algo.)
Approaches to indexing Chinese Text (Segmentation Algo.)
L(g(S)) = log Pg(W1...Wm) = log Pg(Wi|hi) where Wj =Cxj-1+1
...Cxj(j=1..m) and hi is the history words
W1...Wi-1, Here we adopt trigram model with hi=Wi-2Wi-1
• Among all possible segmentations, we pick the one g* with the highest score as our result. That is
g* = arg L(g(S)) = arg log Pg(W1..Wm)
• Let L(k) be the max accumulated score for the first k charac- ters. L(k) is defined for k=1..n with L(1)=0, L(g*) = L(n).
m
i
max)S(Gg
max)S(Gg
Approaches to indexing Chinese Text (Segmentation Algo.)
• Given { L(i) | 1≦i≦k-1 }, L(k) can be computed recursively as follows:
L(k) = [L(i)+log P(Ci+1...Ck|hi)]
p(k) = arg [L(i)+log P(Ci+1...Ck|hi)]
that Cp(k)+1...Ck comprises the last word of optimal segmentation up to the kth character.• For example: a six-character sentence
max1-ki1
max1-ki1
chars
kp(k)
C1 C2 C3 C4 C5 C6
1 2 3 4 5 60 1 1 3 3 4
So the optimal segmentation for the sentence is (C1)(C2C3)(C4)(C5C6)
Discussion and Conclusion
1. Since most Chinese words consist of two characters, the bi-gram/statistical indexing outperform other methods.(even dictionary-based method) (According to New Advances in Computers and Natural Language Processing in China, Liu, Information Science (for China),’87 5% are unigrams, 75% are bigrams, 14% are trigrams, 6% are words of four or more characters) 2. Character-based indexing is not suited for Chinese text retrieval, due to the reasons below:
•Character-based approaches would lead to a great deal of incorrect matching between queries and documents due to quite free combination of characters
Discussion and Conclusion
•Complex concept should always be expressed by a fixed character string in both the doucments and the query.•In character-based approaches, every character is dealt with in the same way.•Character-based approaches do not allow us to easily incorporate linguistic knowledge into the searching process.
3. Word-based indexing is the first step toward concept-based indexing/retrieval, to avoid another information explosion.
Reference
1. A Statistical Method for Finding Word Boundaries in Chinese Text - Richard Sproat and Chilin Shih, CPOCOL ’902. On Chinese Text Retrieval - Jian-Yun Nie, Martin Brisebois, SIGIR ’96.3. An Iterative Algorithm to Build Chinese Language Models – Xiaoqiang Luo, Salim Roukos, ACL ’96.4. Chinese Text Retrieval Without Using a Dictionary – Aitao Chen, Jianzhang He , SIGIR ’975. A Tagging-Based First-Order Markov Model Approach to Automatic Word Identification for Chinese Sentences – T.B.Y Lai, M.S. Sun, COLING ’986. Chinese Indexing using Mutual Information - Christopher C., Asian Digital Library Workshop ’987. A New Statistical Formula for Chinese Text Segmentation Incorporating Contextual Information – Yubin Dai, Teck Ee Loh, SIGIR ’998. Discovering Chinese Words from Unsegmented Text – Xianping Ge, Wanda Pratt, SIGIR ’99
Index file Indexing term Segmentation/indexing method Dictionary ,stop-list
1 Unigram Unigrams Unigram None
2 Bigram Bigrams Bigram Stop-list only
3 Trigram Trigrams Trigram Stop-list only
4 Mi Bigrams,unigrams mutual information Stop-list only
5 Max(f) Words,phrases Maximum matching(forward) Both
6 Max(b) Words,phrases Maximum matching(backword) Both
7 Min(f) Words,phrases Minimum matching(forward) Both
8 Min(b) Words,phrases Minimum matching(backward) both
recall unigram bigram trigram mi Max(f) Max(b) Min(f) Min(b)
0.00 0.7751 0.7504 0.6962 0.7696 0.8000 0.7966 0.7404 0.7265
0.10 0.5609 0.6241 0.5006 0.6500 0.6465 0.6414 0.5543 0.5611
0.20 0.4076 0.5243 0.3600 0.5355 0.5283 0.5028 0.4336 0.4432
0.30 0.3400 0.4778 0.2932 0.4705 0.4308 0.4518 0.3595 0.3734
0.40 0.2904 0.4375 0.2546 0.4324 0.3841 0.4085 0.3049 0.3245
0.50 0.2486 0.3864 0.2153 0.3872 0.3455 0.3671 0.2569 0.2903
0.60 0.2050 0.3295 0.1815 0.3346 0.2947 0.3131 0.2216 0.2351
0.70 0.1576 0.2749 0.1586 0.2843 0.2439 0.2678 0.1657 0.1912
0.80 0.0982 0.2173 0.1142 0.2353 0.1891 0.2017 0.1221 0.1217
0.90 0.0300 0.1241 0.0581 0.1378 0.1051 0.1105 0.0819 0.0778
1.00 0.0031 0.0108 0.0091 0.0208 0.0282 0.0341 0.0197 0.0118
Average precision
0.2609 0.3677 0.2405 0.3744 0.3558 0.3465 0.2738 0.2862
-26.67% 3.34% -32.40% 5.23% baseline -2.61% -23.04% -19.56%←
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Precision
Dictionary
Statistical
Hybrid
•Corpus • 1270 Kbytes• training set : 1247• Test set : 272• 90 words on average, 160 characters per document• segmentation accuracy is around 91%
• Use stop-list such as 的 ,並 ,除非 ,此外…
<Back>
bigram
f(C1)
f(C2)
f(C1C2)
I(C1,C2)
愈演宋楚 楚瑜 瑜興 興票 票案 案愈1103
800
665
6.15
800
673
665
6.64
673
498
1
0.62
498
687
191
5.85
687
1061
66
4.03
1061
107
1
1.70
107
355
2
3.49
Example: 宋楚瑜興票案愈演愈烈 (849967/3998)
Result: 宋 楚瑜 興票 案 愈演 愈烈
演愈 愈烈355
107
2
3.49
107
118
2
4.58
<back>
bigram
f(C1)
f(C2)
f(C1C2)
I(C1,C2)
愈演宋楚 楚瑜 瑜興 興票 票案 案愈2820
2065
1649
6.27
2065
1703
1678
6.79
1703
1310
4
1.21
1310
1945
383
5.64
1945
2891
90
3.40
2891
345
2
1.32
345
1085
4
2.99
(1865718/4513)
Result: 宋 楚瑜 興票 案 愈演 愈烈
演愈 愈烈1085
345
4
2.99
345
360
4
4.10