surveys of some critical issues in chinese indexing

Surveys of Some Critical Issues in Chinese Indexing

Chinese Document Indexing and Word Segmentation

• Speaker : Reuy-Lung, Hsiao• Date : Wed, Dec, 22

Roadmap

1. An overview of Web Information Retrieval systems architecture

2. Automatic indexing overview3. Questions of Chinese document indexing4. Typical approaches to index Chinese

document sets5. Chinese words segmentation mechanism6. Segmentation algorithms7. Discussion and Conclusion8. Reference

System Overview

Information Discovery

IndexingIndexingIndexingIndexing

Index Database

Similarity MeasurementSimilarity Measurement(Ranking)(Ranking)

Similarity MeasurementSimilarity Measurement(Ranking)(Ranking)

Request Response

Result Document Set

QueryFormulation

Chinese Document Indexing

Automatic Indexing Overview

1. Automatic indexing mechanism extracts the features (terms or keywords) of a given document.2. Indexing processes may contains the follow steps: (1)Morphological & Lexical Analysis stemming -> stop list -> weighting -> thesaurus construction (2)Syntactic & Semantic Analysis part-of-speech tagging -> information extraction -> concept extraction3. Weighting plays an important role in retrieval effectiveness. (1)Typical term weighting mechanism : TFxIDF. (2)Typical effectiveness measurement : recall,precision.

Automatic Indexing Overview

4. TFxIDF

5. Recall/Precision Recall =

Precision =

w tfN

dfij ij

j

log

# relevant document

# retrieved relevant document

# retrieved document

# relevant relevant document

Relevance line

Retrieval line

BD

A

C

B+CB

A+BB

=

=

Questions of Chinese Document Indexing

1.Words, rather than characters, should be the smallest indexing unit.

• More specific to the concepts• Less index space required

2.A comprehensive lexicon is needed.3.Chinese text has no delimiters to mark word boundary. for example:English words have spaces and punctuations as separators

中文句子沒有明顯的分隔符號

Approaches to indexing Chinese Text

1.N-gram Indexing• Typically use N = 1,2,3• Produce large index file

2.Statistical Indexing • Typically use mutual information for wo

rd corelation3.Word-based Indexing

• Rule-based approach• Statistical approach• Hybrid approach

Approaches to indexing Chinese Text (N-gram Indexing)

• N-gram indexing terms produced from the same text string

sentence

unigram

bigram

trigram

C1C2C3C4C5C6

C1 , C2 , C3 , C4 , C5 , C6

C1 C2 , C2 C3 , C3 C4 , C4 C5 , C5 C6

C1 C2 C3 , C2 C3 C4 , C3 C4 C5 , C4 C5 C6

• N-gram index size for TREC-5 Chinese collection

n-gram

unigram

bigram

trigram

# distinct n-grams # of n-grams

6,236

1,393,488

8,119,574

64,611,662

54,362,319

49,886,331

Approaches to indexing Chinese Text (Statistical Indexing)

• Mutual Information I(x,y) between two events x and y is defined as

P(x,y)

P(x)P(y)I(x,y) = log2

• If two events occur independently, p(x,y) would be close to p(x)p(y), I(x,y) would be closed to zero.• If two events are strongly related, p(x,y) would be much larger than p(x)p(y), I(x,y) would be large• Using statistical counting to derive probability

P(C1,C2) = P(C1)P(C2|C1) = N f(c1)

f(C1) f(C1C2)

I(C1,C2) = log2 N + log2

=N

f(C1C2)

f(C1C2)

f(C1)f(C2)


• Statistical Indexing Algorithm

1. Compute the mutual information values for all adjacent bigrams.2. Treat the bigram of the largest mutual information value as a word and remove it from the text.3. Perform step 2 on each short phrases until all phrases consist of one or tow characters.

• The following statistics are based on text collections from China Times, from 12/19/99, 12/20/99, 12/21/99.• Totally 621079 characters, 3827 distinct characters per day on average.• Comparsion among above indexing methods. (result)


bigram

f(C1)

f(C2)

f(C1C2)

I(C1,C2)

宣言連戰戰新新的的競競選選宣

Step phrases action

連戰新的競選宣言

543

517

76

5.12

517

1498

0

-7.13

1498

16187

80

0.72

16187

223

34

1.77

223

1028

61

5.11

1028

259

2

1.54

259

305

8

4.14

1 remove 連戰□□新的競選宣言2 remove 競選□□新的□□宣言3 remove 宣言□□新的□□□□4 remove 新的

other example

Approaches to indexing Chinese Text (Word-based Indexing)

1.Rule-based approach• Use a dictionary(lexicon) to match words.• Concept: a correct segmentation result should consist of legitimate words. For example: 中國文學 1.中國文學 2.中國文學 3.中國文學 4.中國文學 5.中國文學 We will choose (1) as the result.• Out-of-Vocabulary problem.

2.Statistical approach• Rely on statistical information such as word and character (co-)occurrence frequencies in the training data.• Concepts Given an sentence,the best solution is composed of a sequence of potential words Si, such that is the highest.

• Supervised/Unsupervised learning• Require large data to acquire accuracy.• Sparse data problem

Approaches to indexing Chinese Text (Word-based Indexing)

i

isP )(

Approaches to indexing Chinese Text (Segmentation Algo.)

• Hybrid Segmentation Algorithm by Jian-Yun Nie, Martin Brisebois, SIGIR ‘96• Use lexicon and statistical information to segment words, with morphological heuristic rule to augment lexicon coverage. (note:supervised learning)• Terminology:

•background knowledge: words contained in the dictionary with default probability (p)•foreground knowledge: statistical information•heuristic rule: two kind of rules are included

•Nominal pre-determiner structure such as 這一年、一百本、每一天•Affix structure such as 小朋友、大眾化


• Algorithm:•Combination of both knowledge if statistical information is available, use it! else background knowledge is taken into account.•Each character in the input string is associated with all the candidate words starting from that character, together with their probability•The candidate words are combined to cover the input string. The word sequence having the highest probability is chosen as the result.

• Example:大會決議和議程項目 (Result)大會決議和議程項目

大會決議議和和議議程項目大會決議

和

議程項目(0.016) (0.029) (0.00108) (0.0005) (0.945) (0.0005) (0.0005) (0.0005) (0.0024)

(1.0) (0.956) (0.001) (0.001) (1.0) (0.936)


• Unsupervised Segmentation Algorithm by Xiaoqiang Luo, Salim Roukos, ACL ‘96• Pure statistical learning model without using dictionary. It divides training set into two parts, randomly segments part-one, and segment part two by part one.• Use the previously-constructed language Model for iteration.• Use Viterbi-like algorithm to build LM.• Concept:

Let a sentence S = C1C2C3..Cn-1Cn, where Ci(1≦i≦n) is a Chinese character.To segment a sentence into words is to group these charac-ters into words, i.e.

S = C1C2...Cn-1Cn = (C1...Cx1)(Cx1+1

...Cx2)...(Cxm-1+1

...Cxm)

= W1W2...Wm where xk is the index of the last character in kth word Wk, i.e. Wk=Cxk-1+1

...Cxk (k=1..m), and x0=0,xm=n

• A segmentation of the sentence S can be uniquely represented by an integer sequence X1,...,Xm, so we denote all possible segmentation by

G(S)={ (x1...xm)|1≦x1≦...≦xm,m≦n }

and assign a score for a segmentation g(S)=(x1...xm)G(S) by



L(g(S)) = log Pg(W1...Wm) = log Pg(Wi|hi) where Wj =Cxj-1+1

...Cxj(j=1..m) and hi is the history words

W1...Wi-1, Here we adopt trigram model with hi=Wi-2Wi-1

• Among all possible segmentations, we pick the one g* with the highest score as our result. That is

g* = arg L(g(S)) = arg log Pg(W1..Wm)

• Let L(k) be the max accumulated score for the first k characters. L(k) is defined for k=1..n with L(1)=0, L(g*) = L(n).

m

i

max)S(Gg

max)S(Gg


• Given { L(i) | 1≦i≦k-1 }, L(k) can be computed recursively as follows:

L(k) = [L(i)+log P(Ci+1...Ck|hi)]

p(k) = arg [L(i)+log P(Ci+1...Ck|hi)]

that Cp(k)+1...Ck comprises the last word of optimal segmentation up to the kth character.• For example: a six-character sentence

max1-ki1

max1-ki1

chars

kp(k)

C1 C2 C3 C4 C5 C6

1 2 3 4 5 60 1 1 3 3 4

So the optimal segmentation for the sentence is (C1)(C2C3)(C4)(C5C6)

Discussion and Conclusion

1. Since most Chinese words consist of two characters, the bi-gram/statistical indexing outperform other methods.(even dictionary-based method) (According to New Advances in Computers and Natural Language Processing in China, Liu, Information Science (for China),’87 5% are unigrams, 75% are bigrams, 14% are trigrams, 6% are words of four or more characters) 2. Character-based indexing is not suited for Chinese text retrieval, due to the reasons below:

•Character-based approaches would lead to a great deal of incorrect matching between queries and documents due to quite free combination of characters

Discussion and Conclusion

•Complex concept should always be expressed by a fixed character string in both the doucments and the query.•In character-based approaches, every character is dealt with in the same way.•Character-based approaches do not allow us to easily incorporate linguistic knowledge into the searching process.

3. Word-based indexing is the first step toward concept-based indexing/retrieval, to avoid another information explosion.

Reference

1. A Statistical Method for Finding Word Boundaries in Chinese Text - Richard Sproat and Chilin Shih, CPOCOL ’902. On Chinese Text Retrieval - Jian-Yun Nie, Martin Brisebois, SIGIR ’96.3. An Iterative Algorithm to Build Chinese Language Models – Xiaoqiang Luo, Salim Roukos, ACL ’96.4. Chinese Text Retrieval Without Using a Dictionary – Aitao Chen, Jianzhang He , SIGIR ’975. A Tagging-Based First-Order Markov Model Approach to Automatic Word Identification for Chinese Sentences – T.B.Y Lai, M.S. Sun, COLING ’986. Chinese Indexing using Mutual Information - Christopher C., Asian Digital Library Workshop ’987. A New Statistical Formula for Chinese Text Segmentation Incorporating Contextual Information – Yubin Dai, Teck Ee Loh, SIGIR ’998. Discovering Chinese Words from Unsegmented Text – Xianping Ge, Wanda Pratt, SIGIR ’99

Index file Indexing term Segmentation/indexing method Dictionary ,stop-list

1 Unigram Unigrams Unigram None

2 Bigram Bigrams Bigram Stop-list only

3 Trigram Trigrams Trigram Stop-list only

4 Mi Bigrams,unigrams mutual information Stop-list only

5 Max(f) Words,phrases Maximum matching(forward) Both

6 Max(b) Words,phrases Maximum matching(backword) Both

7 Min(f) Words,phrases Minimum matching(forward) Both

8 Min(b) Words,phrases Minimum matching(backward) both

recall unigram bigram trigram mi Max(f) Max(b) Min(f) Min(b)

0.00 0.7751 0.7504 0.6962 0.7696 0.8000 0.7966 0.7404 0.7265

0.10 0.5609 0.6241 0.5006 0.6500 0.6465 0.6414 0.5543 0.5611

0.20 0.4076 0.5243 0.3600 0.5355 0.5283 0.5028 0.4336 0.4432

0.30 0.3400 0.4778 0.2932 0.4705 0.4308 0.4518 0.3595 0.3734

0.40 0.2904 0.4375 0.2546 0.4324 0.3841 0.4085 0.3049 0.3245

0.50 0.2486 0.3864 0.2153 0.3872 0.3455 0.3671 0.2569 0.2903

0.60 0.2050 0.3295 0.1815 0.3346 0.2947 0.3131 0.2216 0.2351

0.70 0.1576 0.2749 0.1586 0.2843 0.2439 0.2678 0.1657 0.1912

0.80 0.0982 0.2173 0.1142 0.2353 0.1891 0.2017 0.1221 0.1217

0.90 0.0300 0.1241 0.0581 0.1378 0.1051 0.1105 0.0819 0.0778

1.00 0.0031 0.0108 0.0091 0.0208 0.0282 0.0341 0.0197 0.0118

Average precision

0.2609 0.3677 0.2405 0.3744 0.3558 0.3465 0.2738 0.2862

-26.67% 3.34% -32.40% 5.23% baseline -2.61% -23.04% -19.56%←

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Precision

Dictionary

Statistical

Hybrid

•Corpus • 1270 Kbytes• training set : 1247• Test set : 272• 90 words on average, 160 characters per document• segmentation accuracy is around 91%

• Use stop-list such as 的 ,並 ,除非 ,此外…

<Back>

bigram

f(C1)

f(C2)

f(C1C2)

I(C1,C2)

愈演宋楚楚瑜瑜興興票票案案愈1103

800

665

6.15

800

673

665

6.64

673

498

1

0.62

498

687

191

5.85

687

1061

66

4.03

1061

107

1

1.70

107

355

2

3.49

Example: 宋楚瑜興票案愈演愈烈 (849967/3998)

Result: 宋楚瑜興票案愈演愈烈

演愈愈烈355

107

2

3.49

107

118

2

4.58

<back>

bigram

f(C1)

f(C2)

f(C1C2)

I(C1,C2)

愈演宋楚楚瑜瑜興興票票案案愈2820

2065

1649

6.27

2065

1703

1678

6.79

1703

1310

4

1.21

1310

1945

383

5.64

1945

2891

90

3.40

2891

345

2

1.32

345

1085

4

2.99

(1865718/4513)

Result: 宋楚瑜興票案愈演愈烈

演愈愈烈1085

345

4

2.99

345

360

4

4.10

surveys of some critical issues in chinese indexing

Documents

indexing methods

automatic indexing mechanism

automatic indexing overview4

smallest indexing unit

chinese text1

c3 c4 c5

c2 c3 c4

c5 c6c1 c2 c3