final project of information retrieval and extraction by d93921022 吳蕙如

Final Project of Information Retrieval and Extraction

by d93921022 吳蕙如

Working Environment

• OS : Linux 7.3• CPU : C800Mhz• Memory : 128 MB• Tool used :

– stopper

– stemmer

– trec_eval

– sqlite

• Language used : – shell script : control th

e inverted file indexing procedures

– AWK : used for extract needed part from documents

– sql : used while trying to adopt the file format database - sqlite.

First Indexing Trial

1. FBIS Source Files2. Documents Separation

18’51” + 55’13”3. Documents Pass Stemmer

33’52” + 1:00’58”4. Documents Pass Stopper

33’23” + 1:09’29”5. Words sort by AWK

44’07” + 1:19’09”6. Term Frequency Count and

Inverted File Indexing (one file per word)> 9hours, never finished

• While considering about the indexing procedures, the most directly way is doing it step by step.

• So in the first trial, I did each step and save the result as input of next step.

• However, as the directory size grew, the time cost to write a file increased out of control.

• Time cost of index file generating seems unacceptable and was stopped after 9 hours.

Second Indexing Trial


23’29” + 58’36”3. Documents Pass Stemmer

30’05” + 1:07’26”4. Documents Pass Stopper

22’34” + 52’29”5. Words Sort by AWK

22’44” + 48’27”6. Words Count and Indexing

1. Two Suffix Directory Separating 5”

2. Word Files Indexing12:41’00” + break

• The index generating took too much time.

• This seemed to be caused by the number of files in a directory.

• So, I tried to set up 26*26 sub directories basing on the first two characters of each words and separated the index files storage.

• However, it still took so long, and this trial was stopped while finishing FBIS3 after almost 13 hours.

Third Indexing Trial


20’15” + 1:09’38”3. Documents Pass Stemmer

29’25” + 55’42”4. Documents Pass Stopper

and Sort34’17” + 1:05’48”

5. Words Count and Indexing1. Suffix Directory Separating

6”2. Word Files Indexing

(break after 11 hours)

• Well, before finding out a way to solve time consuming problem of indexing, the steps before also cost a lot of time.

• I tried to combine the steps with pipeline command, but only worked when using system sort command.

• After using stopper | sort step, at least one hour is saved.

• Time cost is still far from acceptable.

Fourth Indexing Trial

1. FBIS Source Files 33’51” + 1:00’38”

1. Documents Separation2. Documents Pass

Stemmer3. Documents Pass Stopper

and Sort

2. Words Count and Indexing

1. Suffix Directory Separating 2”

2. Word Files Indexing 13:14’23” + 14:15’12”

• I finally found out the time was mostly cost on searching the location for next writing, which is a space allocation characteristic of linux systems.

• So, I combined the former steps by doing a run from per source file to the sorted ones. All middle files are removed as soon as used by the next part.

• The time consuming decreased amazingly. It only cost one-third of time used in last trail.

• Indexing was finished for the first time after 29 hours.

Fifth Indexing Trial

1. For Each FBIS Source File1:10’26” + 1:19’29”

1. Documents Separation

2. Documents Pass Stemmer

3. Documents Pass Stopper and Sort

4. Words Count and Database Indexing

• The indexing took just so long and I really want to find a way for decreasing the time cost.

• A file format database may be a solution.

• So, I adopt sqlite and write all my index lines as table rows into a file using sqlite.

• The time cost was immediately down to totally two and half hours, how amazing.

Indexing - Level Analysis

1. For Each FBIS Source File 1:08’53” + 1:16’39” v.s. 2:22’57”document count 61578 130417 v.s. 130417 (same)file size 262877184 542937088 v.s. same

1. Documents Separation

2. Documents Pass Stemmer

3. Documents Pass Stopper and Sort

4. Words Count and Database Indexing

• Since the whole indexing can be done in 2.5 hours, I then tried to count the level influence.

• I tried to index FBIS3 then FBIS4 separately, then combined them as a set and tried again.

• The time costs were nearly the same, and the document counts and file sizes were all equaled.

• This is not at all surprising because of the working procedure did not add any outside information in.

Sixth Indexing Trial

1. For Each FBIS Source File35’49” + 39’47”33’04” + 35’43”file size 176340992 365469696

1. Documents Separation2. Documents Pass

Stemmer3. Documents Pass Stopper

and Sort4. Words Count and Write

in Single Indexing File

• While revisiting the fourth and fifth trial, I figured out maybe the problem is the number of index files.

• So I tried to write all the indexing message into a single file.

• Two sub part were tried : – Write after counting term

frequency of each word.– Append after compute all

frequency of a document.

Seventh Indexing Trial

1. For Each FBIS Source File44’38” + 50’32”file number 646 655total file size 178606080 367759360

1. Documents Separation2. Documents Pass Stemmer3. Documents Pass Stopper

and Sort4. Words Count and write

into 26*26 Indexing File

• When consider about query and indexing, single index file is just to large and would cost a long time to search for wanted terms.

• So, I modified the final step and write the index lines into different files based on the word suffix.

Indexing Timeindexing FBIS 3 FBIS 4 total

trial 1 18’51”+33’52”+33’23” ＋ 44’07”+ ？ >> 2:10’13”

55’13” ＋ 1:00’58” ＋ 1:09’29”＋ 1:19’09” ＋？ >> 4:24’49”

>> 6:35’02

trial 2 23’29”+30’05”+22’34” ＋ 22’44” +5” ＋ 12:41’00” ＝ 14:19’57”

58’36” ＋ 1:07’26”+52’29” ＋48’27” ＋？ >> 3:46’58”

>> 18:06’55”

trial 3 20’15”+29’25”+34’17”+6” ＋？ >> 1:24’03”

1:09’38” ＋ 55’42” ＋ 1:05’48”＋？ >> 3:11’08”

>> 4:35’11”

trial 4 33’51”+13:14’23” ＝ 13:48’14” 1:00’38” ＋ 14:15’12” ＝ 15:15’50”

29:04’04”

trial 5 1:10’26” 1:19’29” 2:29’55”

trial 6-1 35’49” 39’47” 1:15’36”

trial 6-2 33‘04“ 35’49“ 1:08’47“

trial 7 44’38“ 50’32“ 1:35’10“

First Topic Query

1. Extract Topics from Source Files and Pass Stemmer and Stopper 1”

2. Select Per Keyword Data from Index Database or Index file

3. Weight Computing

4. Ranking and Filtering

5. Evaluation

• Five query topics, totally 15 keywords

• Total time to query :– Index database :

13’38” 31’27– Single index file :

9’00” 18’39”– Separated index file :

2’ 04”

• Seems not efficient enough. If exam several terms together, more time should be saved.

Second Topic Query

1. Extract Topics from Source Files and Pass Stemmer and Stopper

2. Generate One Query Strings for each topic

3. Select Data from Index Database or Index File

4. Weight Computing

5. Ranking and Filtering

6. Evaluation

• Total time to query :– Index database :

2’30” 5’19”– Single index file :

2’26”4’55”– Separated index file :

not much progress expected, for the queried file need to be checked separately.

• But, as query terms increase, using separated index file would save a lot more search time.

Updated Topic Query

1. Extract Topics from Source Files and Pass Stemmer and Stopper

2. Generate Query Strings based on frequency of each term

3. Select Data from Index Database or Index File

4. Weight Computing5. Ranking and Filtering6. Evaluation

• Some of the terms in the topics seem to get far too much return documents and seem not work at all.

• Check the document frequency of each terms and removed the high frequency (>10%) terms.

• Did not work, some more related terms need to be used for better precision.

Frequency Term Query

1. Select Some Terms based on Descriptions, Narratives and web queries for each topic

2. Order these terms based on document frequency of each word

3. Deciding the Number of Terms to Use and Generate Query Strings

4. The Following Steps are same as before

• Number of terms are tried from five to 100.

• The precision increase only in the beginning of adding terms.

• While the query time raise proportionally as the query terms increase.

• Terms of high frequency were removed, threshold were 10% and 20%.

• More strict frequency limit (10%) seem to help.

Query : Topic

301 302 304 306 307 allFBIS 3 num_rel 198 18 15 48 42 321

FBIS 3+4 339 33 36 162 95 665FBIS 3 map 0.0646 1.7786 0.3391 0.069 0.8065 0.6116

FBIS 3+4 0.0282 1.7585 0.3506 0.019 0.6185 0.555FBIS 3 ircl_prn.0.30 0.064 1 0.2154 0.0357 0.5714 0.3773

FBIS 3+4 0.0334 1 0.2273 0.0346 0.5333 0.3657FBIS 3 P10 0.3 1 0 0.2 0.6 0.42

FBIS 3+4 0.1 1 0 0 0.6 0.34

Query : Updated Topic

301 302 304 306 307 allFBIS 3+4 num_rel 339 33 36 162 95 665

topic map 0.0282 1.7585 0.3506 0.019 0.6185 0.555updated 0.0218 0.8821 0.2238 0.0132 0.2676 0.2817

topic ircl_prn.0.30 0.0334 1 0.2273 0.0346 0.5333 0.3657updated 0 1 0.2933 0 0.4366 0.346

topic P10 0.1 1 0 0 0.6 0.34updated 0.1 1 0.1 0.2 0.6 0.4

Query : Terms

frequenct 5 10 15 20 30 40 60 80 100 topic

th=10% num_rel_ret 467 480 491 491 479 467 412 427 412 398th=20% 335 459 450 457 460 444 431 412 394th=10% map 0.332 0.313 0.293 0.309 0.294 0.273 0.212 0.231 0.212 0.555th=20% 0.254 0.309 0.285 0.267 0.266 0.241 0.229 0.203 0.184th=10% ircl_prn.0.30 0.474 0.479 0.406 0.431 0.400 0.377 0.337 0.364 0.337 0.353th=20% 0.353 0.473 0.438 0.369 0.358 0.322 0.326 0.291 0.264th=10% P10 0.64 0.52 0.42 0.54 0.42 0.34 0.22 0.22 0.22 0.34th=20% 0.44 0.5 0.44 0.38 0.42 0.38 0.2 0.24 0.18

Query Time

0

200

400

600

800

1000

1200

1400

topic 5 10 15 20 30 40 60 80 100

db FBIS 3 db FBIS 3+4 file FBIS 3 file FBIS 3+4

query term topic 5 10 15 20 30 40 60 80 100

db FBIS 3 30 44 71 100 126 180 234 347 462 582

db FBIS 3+4 63 94 147 202 258 372 484 721 953 1197

file FBIS 3 43 67 90 117 143 192 245 349 476 594

file FBIS 3+4 89 135 188 243 343 404 510 722 986 1232

Conclusion

• As I examined the index file and term frequency I generated. I found that there are so many terms seem to be useless.

• They may be meaningless, like “aaaf”, or wrong spelling, like “internacion”.

• Some terms have frequency count less than three.

• If these terms are removed, the query would be doing even faster, I suppose.

• I could have spent more time to sort and index the inverted file.

• However, when I tried part of this, the time consuming made me consider about if it is worthwhile.

• Maybe just a recent query cache is better than a full sort process.

• Well, this makes the end of my project report.