1 a text filtering method for digital libraries mustafa zafer bolat hayri sever
Post on 21-Dec-2015
231 views
TRANSCRIPT
2
introduction• Information filtering (IF)
– Incoming relevant documents are routed to profilesqueries.
• Information retrieval (IR)– Provides a list of ordered documents based
on the similarity with the user query
3
introduction (continued...)
• Linear Separation - partitions relevant and non-relevant
into distinct blocks
• Optimal Queries- all relevant documents are ahead of
nonrelevant ones.
• Steepest Descent Algorithm (SDA)
4
preliminaries
• Information retrieval system (S) can be defined as 5 tuple
• S =(T,D,Q,V,f)
-T set of ordered index terms-D set of documents-Q set of queries-V set of real numbers-f:DxQ V retrieval function
5
preliminaries (continued)
• Vector Space Model- Transformation of raw text into more computationally useful forms
- Documents and queries are represented as vectors of weighted terms
• d=(t1,wd1;t2,wd2;. . .;tn,wdn) ti T d
• q = (q1, wq1 ; q2, wq2, . . . ; qm, wqm) qi T q
6
preliminaries (continued)
• Rnorm value for effectiveness It measures up how relevant documents are distributed over nonrelavent ones.
rank matters.
7
preliminaries (continued)predicted actual
relevant non-relevant
relevant a bnon-relevant c d
Contingency Table
•Precision =a / (a+b) •Recall =a / (a+c)
•Breakeven pointWhere precision and recall are equal
8
overview of experiment
TrainingWithSDA
Optimal query
...train
test
Reuters -21578Data set Category
labels
Effectivenessmeasures
Preprocessing
Removingstop words
Stemming
Transform to Vectors
Parsing
Reducing Normalizing
9
overview of experiment
train
Removingstop words
Stemming
Transform to Vectors
Parsing
TrainingWithSDA
Optimal query
test
...Reuters -21578Data set Category
labels
Effectivenessmeasures
Preprocessing
Consists of 21578 economic news stories thatoriginally appeared on the Reuters newswire in 1987
Each story has been manually assigned one or more indexing labels from a fixed list
There are 135 TOPIC labels for classification.In order to use a text corpus for machine learning
research it splited into sets of training and testing examples
Reuters 21578
train
test
Reuters -21578Data set
10
overview of experiment
train
Removingstop words
Stemming
Transform to Vectors
Parsing
TrainingWithSDA
Optimal query
test
...Reuters -21578Data set Category
labels
Effectivenessmeasures
PrePocessing
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"
OLDID="9944" NEWID="5031"><DATE>13-MAR-1987 15:45:35.38</DATE>
<TOPICS><D>livestock</D><D>carcass</D></TOPICS><PLACES><D>usa</D></PLACES>
<PEOPLE></PEOPLE><ORGS><D>ec</D></ORGS>
<EXCHANGES></EXCHANGES><COMPANIES></COMPANIES>
<TEXT><TITLE>U.S. MEAT GROUP TO FILE TRADE COMPLAINTS</TITLE>
<DATELINE> WASHINGTON, March 13 - </DATELINE><BODY>The American Meat Institute, AME,said it intended to ask the U.S.
government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S. meat products.
Molpus told a Senate Agriculture subcommittee that AME andother livestock and farm groups intended to file a petition
under Section 301 of the General Agreement on Tariffs and Tradeagainst an EC directive that, effective April 30, will require
U.S. meat processing plants to comply fully with EC standards.
Reuter</BODY></TEXT>
</REUTERS>
Sample Reuters 21578 Document
train
test
Reuters -21578Data set
11
train
test
Reuters -21578Data set
overview of experiment
train
Removingstop words
Stemming
Transform to Vectors
Parsing
TrainingWithSDA
Optimal query
test
...Reuters -21578Data set Category
labels
Effectivenessmeasures
PrePocessing
ParsingHAS TOPICS=YES
LEWISSPLIT=TRAINTOPICS:livestock,carcass
Body: U.S. MEAT GROUP TO FILE TRADE COMPLAINTSThe American Meat Institute, AME,said it intended to ask the U.S.
government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S.
meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General
Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply
fully with EC standards
12
train
test
Reuters -21578Data set
overview of experiment
train
Removingstop words
Stemming
Transform to Vectors
Parsing
TrainingWithSDA
Optimal query
test
...Reuters -21578Data set Category
labels
Effectivenessmeasures
PrePocessing
After ParsingHAS TOPICS=YES
LEWISSPLIT=TRAINTOPICS:livestock,carcass
Body: U S MEAT GROUP TO FILE TRADE COMPLAINTSThe American Meat Institute AME said it intended to ask the U S
government to retaliate against a European Community meat inspection requirement AME President C Manly Molpus also said the industry would file a petition challenging Korea's ban of U S
meat products Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups
intended to file a petition under Section of the General Agreement on Tariffs and Trade against an EC directive that
effective April will require U S meat processing plants to comply fully with EC standards
13
train
test
Reuters -21578Data set
overview of experiment
train
Removingstop words
Stemming
Transform to Vectors
Parsing
TrainingWithSDA
Optimal query
test
...Reuters -21578Data set Category
labels
Effectivenessmeasures
PrePocessing
Removing Stop WordsHAS TOPICS=YES
LEWISSPLIT=TRAINTOPICS:livestock,carcass
Body: U.S. MEAT GROUP FILE TRADE COMPLAINTSThe American Meat Institute, AME,said it intended to ask the U.S.
government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S.
meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General
Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply
fully with EC standards
14
train
test
Reuters -21578Data set
overview of experiment
train
Removingstop words
Stemming
Transform to Vectors
Parsing
TrainingWithSDA
Optimal query
test
...Reuters -21578Data set Category
labels
Effectivenessmeasures
PrePocessing
After Removing Stop WordsHAS TOPICS=YES
LEWISSPLIT=TRAINTOPICS:livestock,carcass
Body: . MEAT GROUP FILE TRADE COMPLAINTSAmerican Meat Institute AME intended ask
government retaliate European Community meat inspection requirement. AME President Manly Molpus
industry file petition challenging Korea's ban U.S. meat products Molpus Senate Agriculture subcommittee AME livestock farm groups
intended file petition Section General Agreement Tariffs Trade EC directive
effective April require meat processing plants comply fully EC standards
15
overview of experiment
train
TrainingWithSDA
Optimal query
test
...Reuters -21578Data set Category
labels
Effectivenessmeasures
PrePocessing
StemmingHAS TOPICS=YES
LEWISSPLIT=TRAINTOPICS:livestock,carcass
Body: MEAT GROUP FILE TRADE COMPLAINTAmerican Meat Institute AME intended ask
government retaliate European Community meat inspection requirement. AME President Manly
Molpus industry file petition challeng Korea ban meat product Molpus Senate Agriculture subcommittee AME livestock farm group intended file petition Section General
Agreement Tariff Trade EC direct effect April require meat process plant compli
fulli EC standard
Removingstop words
Stemming
Transform to Vectors
Parsing
ReducingNormalizing
train
test
Reuters -21578Data set
16
overview of experiment
train
TrainingWithSDA
Optimal query
test
...Reuters -21578Data set Category
labels
Effectivenessmeasures
PrePocessing
Transform To VectorsHAS TOPICS=YES
LEWISSPLIT=TRAINTOPICS:livestock,carcass
Removingstop words
Stemming
Transform to Vectors
Parsing
Reducing Normalizing
meat 5group 1
... ...Molpus 1
... ...
... ...standard 1
train
test
Reuters -21578Data set
17
overview of experiment
train
TrainingWithSDA
Optimal query
test
...Reuters -21578Data set Category
labels
Effectivenessmeasures
PrePocessing
Create Dictionary (only in training)
Removingstop words
Stemming
Transform to Vectors
Parsing
Reducing Normalizing
approv 1236chairman 1225
... ...
... ...
... ...
... ...ptd 5
train
test
Reuters -21578Data set
18
overview of experiment
train
TrainingWithSDA
Optimal query
test
...Reuters -21578Data set Category
labels
Effectivenessmeasures
PrePocessing
ReducingHAS TOPICS=YES
LEWISSPLIT=TRAINTOPICS:livestock,carcass
Removingstop words
Stemming
Transform to Vectors
Parsing
Reducing Normalizing
... ...group 1meat 5Molpus ...
... ...standard 1
... ...
train
test
Reuters -21578Data set
19
overview of experiment
train
TrainingWithSDA
Optimal query
test
...Reuters -21578Data set Category
labels
Effectivenessmeasures
PrePocessing
After ReducingHAS TOPICS=YES
LEWISSPLIT=TRAINTOPICS:livestock,carcass
Removingstop words
Stemming
Transform to Vectors
Parsing
Reducing Normalizing
... ...group 1meat 5
... ...standard 1
... ...
train
test
Reuters -21578Data set
20
overview of experiment
train
TrainingWithSDA
Optimal query
test
...Reuters -21578Data set Category
labels
Effectivenessmeasures
PrePocessing
Normalizing HAS TOPICS=YES
LEWISSPLIT=TRAINTOPICS:livestock,carcass
Removingstop words
Stemming
Transform to Vectors
Parsing
Reducing Normalizing
... ...group 1meat 5
... ...standard 1
... ...
train
test
Reuters -21578Data set
wk =tk x log (ND /nk)
tk term frequency
ND Number of documents in collection
nk number of documents containing tk
is normalized weight of term k
unnormalized weight of term k
2' / www kk'kw
kw
21
overview of experiment
train
TrainingWithSDA
Optimal query
test
...Reuters -21578Data set Category
labels
Effectivenessmeasures
PrePocessing
After Normalizing HAS TOPICS=YES
LEWISSPLIT=TRAINTOPICS:livestock,carcass
Removingstop words
Stemming
Transform to Vectors
Parsing
Reducing Normalizing
... ...group 0.127meat 0.278
... ...standard 0.012
... ...
train
test
Reuters -21578Data set
wk =tk x log (ND /nk)
tk term frequency
ND Number of documents in collection
nk number of documents containing tk
is normalized weight of term k
unnormalized weight of term k
2' / www kk'kw
kw
22
overview of experiment
train
test
...Reuters -21578Data set Category
labels
Effectivenessmeasures
PrePocessing
Removingstop words
Stemming
Transform to Vectors
Parsing
Reducing
Training
1. Choose a starting query vector Q0; let k = 0.
2. Let Qk be a query vector at the start of
the (k+1)th iteration; identify thefollowing set of difference vectors: (Qk) ={b=d- d’ :d d’ and
f(Qk,b) 0}; if (Qk)= ,
Qopt = Qk is a solution
and exit, otherwise, 3. Let Qk+1 = Qk +
4. k = k+1; go back to Step (2).
)(Qkb
b
TrainingWithSDA
Optimal query
23
overview of experiment
train
Optimal query
test
...Reuters -21578Data set Category
labels
Effectivenessmeasures
PrePocessing
Removingstop words
Stemming
Transform to Vectors
Parsing
Reducing Normalizing
Training• All the category examples as positive examples • Random 60% from other topicsas negative examples
• If maximum Rnorm value (1)is not reached at maximum 150 iterations set optimal query as the query that produces maximum Rnorm value available
TrainingWithSDA
24
overview of experiment
TrainingWithSDA
Optimal query
...train
test
Reuters -21578Data set Category
labels
Effectivenessmeasures
PrePocessing
Removingstop words
Stemming
Transform to Vectors
Parsing
Reducing Normalizing
There are 135 categories
Topic # of + earn 2877acq 1650moneyfx 538grain 433crude 389trade 369interest 347wheat 212ship 197corn 182
Topic # of earn 1087acq 719moneyfx 179grain 149crude 189trade 118interest 131wheat 71ship 89corn 56
traintest
25
overview of experiment
TrainingWithSDA
Optimal query
...train
test
Reuters -21578Data set Category
labels
Effectivenessmeasures
PrePocessing
Removingstop words
Stemming
Transform to Vectors
Parsing
Reducing Normalizing
Create contingency tables
Find breakeven points
26
ResultsTopic Findism Nbayes SDA Bnets Trees SVM
earn 92,9 95,9 96,32 95,8 97,8 98,0
acq 64,7 87,8 85,26 88,3 89,7 93,6
money-fx 46,7 56,6 68,72 58,8 66,2 74,5
grain 67,5 78,8 71,81 81,4 85,0 94,6
crude 70,1 79,5 82,54 79,6 85,0 88,9
trade 65,1 63,5 65,25 69,0 72,5 75,9
interest 63,4 64,9 61,07 71,3 67,1 77,7
wheat 68,9 69,7 76,06 82,7 92,5 91,9
ship 49,2 85,4 65,17 84,4 74,2 85,6
corn 48,2 65,3 75,00 76,4 91,8 90,3
Avg.Top 10 64,6 81,5 84,54 85,0 88.4 92,0
Avg.All 61,7 75,2 76,37 80,0 N/A 87,0
breakevens