baskent university text filtering1 a text filtering method for digital libraries mustafa zafer bolat...

28
Baskent University Te xt Filtering 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING DEPARTMENT

Upload: roderick-harris

Post on 20-Jan-2016

232 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 1

A Text Filtering Method For Digital Libraries

Mustafa Zafer BOLAT

Hayri SEVER

BASKENT UNIVERSITYCOMPUTER ENGINEERING DEPARTMENT

Page 2: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 2

introduction• Information filtering (IF)

– Incoming non-relevant documents are filtered out.

• Information retrieval (IR)– Provides a list of ordered documents based

on the similarity with the user query

Page 3: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 3

introduction (continued...)

• Linear Separation - partitions relevant and non-relevant

into distinct blocks

• Optimal Queries- all relevant documents are ahead of non-

relevant ones.

• Steepest Descent Algorithm (SDA)

Page 4: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 4

preliminaries

• Information retrieval system (S) can be defined as 5 tuple

• S =(T,D,Q,V,f)

-T set of ordered index terms-D set of documents-Q set of queries-V set of real numbers-f:DxQ V retrieval function

Page 5: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 5

preliminaries (continued)

• Vector Space Model- Transformation of raw text into more computationally useful forms

- Documents and queries are represented as vectors of weighted terms

• d=(t1,wd1;t2,wd2;. . .;tn,wdn) ti T d

• q = (q1, wq1 ; q2, wq2, . . . ; qm, wqm) qi T q

Page 6: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 6

preliminaries (continued)

• Rnorm value for effectiveness It measures up how relevant documents are distributed over non-relavent ones.

rank matters.

Page 7: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 7

preliminaries (continued)

• Rnorm value for effectiveness It measures up how relevant documents are distributed over non-relavent ones.

rank matters.

•S+ number of document pairs where preferred document is ranked higher•S- number of document pairs where non-preferred document is ranked higher•S+

max maximal number of S+

=(rnrn | rnnnnn )

S+ =10 S- =2 S+max =21

Page 8: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 8

preliminaries (continued)predicted actual

relevant non-relevant

relevant a bnon-relevant c d

Contingency Table

•Precision =a / (a+b) •Recall =a / (a+c)

•Breakeven pointWhere precision and recall are equal

Page 9: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

9

overview of experiment

TrainingWithSDA

Optimal query

...train

test

Reuters -21578Data set Topics

Effectivenessmeasures

Preprocessing

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

Page 10: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 10

overview of experiment

train

Removingstop words

Stemming

Transform to Vectors

Parsing

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

Preprocessing

Consists of 21578 economic news stories thatoriginally appeared on the Reuters newswire in 1987

Each story has been manually assigned one or more indexing labels from a fixed list

There are 135 TOPIC labels for classification.In order to use a text corpus for machine learning

research it splited into sets of training and testing examples

Reuters 21578

train

test

Reuters -21578Data set

Page 11: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 11

overview of experiment

train

Removingstop words

Stemming

Transform to Vectors

Parsing

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"

OLDID="9944" NEWID="5031"><DATE>13-MAR-1987 15:45:35.38</DATE>

<TOPICS><D>livestock</D><D>carcass</D></TOPICS><PLACES><D>usa</D></PLACES>

<PEOPLE></PEOPLE><ORGS><D>ec</D></ORGS>

<EXCHANGES></EXCHANGES><COMPANIES></COMPANIES>

<TEXT>&#2;<TITLE>U.S. MEAT GROUP TO FILE TRADE COMPLAINTS</TITLE>

<DATELINE> WASHINGTON, March 13 - </DATELINE><BODY>The American Meat Institute, AME,said it intended to ask the U.S.

government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S. meat products.

Molpus told a Senate Agriculture subcommittee that AME andother livestock and farm groups intended to file a petition

under Section 301 of the General Agreement on Tariffs and Tradeagainst an EC directive that, effective April 30, will require

U.S. meat processing plants to comply fully with EC standards.

Reuter&#3;</BODY></TEXT>

</REUTERS>

Sample Reuters 21578 Document

train

test

Reuters -21578Data set

Page 12: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 12

train

test

Reuters -21578Data set

overview of experiment

train

Removingstop words

Stemming

Transform to Vectors

Parsing

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

ParsingHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Body: U.S. MEAT GROUP TO FILE TRADE COMPLAINTSThe American Meat Institute, AME,said it intended to ask the U.S.

government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S.

meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General

Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply

fully with EC standards

Page 13: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 13

train

test

Reuters -21578Data set

overview of experiment

train

Removingstop words

Stemming

Transform to Vectors

Parsing

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

After ParsingHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Body: U S MEAT GROUP TO FILE TRADE COMPLAINTSThe American Meat Institute AME said it intended to ask the U S

government to retaliate against a European Community meat inspection requirement AME President C Manly Molpus also said the industry would file a petition challenging Korea's ban of U S

meat products Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups

intended to file a petition under Section of the General Agreement on Tariffs and Trade against an EC directive that

effective April will require U S meat processing plants to comply fully with EC standards

Page 14: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 14

train

test

Reuters -21578Data set

overview of experiment

train

Removingstop words

Stemming

Transform to Vectors

Parsing

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

Removing Stop WordsHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Body: U.S. MEAT GROUP FILE TRADE COMPLAINTSThe American Meat Institute, AME,said it intended to ask the U.S.

government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S.

meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General

Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply

fully with EC standards

Page 15: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 15

train

test

Reuters -21578Data set

overview of experiment

train

Removingstop words

Stemming

Transform to Vectors

Parsing

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Topics

labels

Effectivenessmeasures

PrePocessing

After Removing Stop WordsHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Body: . MEAT GROUP FILE TRADE COMPLAINTSAmerican Meat Institute AME intended ask

government retaliate European Community meat inspection requirement. AME President Manly Molpus

industry file petition challenging Korea's ban U.S. meat products Molpus Senate Agriculture subcommittee AME livestock farm groups

intended file petition Section General Agreement Tariffs Trade EC directive

effective April require meat processing plants comply fully EC standards

Page 16: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 16

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

StemmingHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Body: MEAT GROUP FILE TRADE COMPLAINTAmerican Meat Institute AME intend ask

government retaliate European Community meat inspection require. AME President Manly

Molpus industry file petition challeng Korea ban meat product Molpus Senate Agriculture subcommittee AME livestock farm group intended file petition Section General

Agreement Tariff Trade EC direct effect April require meat process plant compli

fulli EC standard

Removingstop words

Stemming

Transform to Vectors

Parsing

ReducingNormalizing

train

test

Reuters -21578Data set

Page 17: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 17

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

Transform To VectorsHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

meat 5group 1

... ...Molpus 1

... ...

... ...standard 1

train

test

Reuters -21578Data set

Page 18: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 18

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

Create Dictionary (only in training)

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

approv 1236chairman 1225

... ...

... ...

... ...

... ...ptd 5

train

test

Reuters -21578Data set

Page 19: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 19

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

ReducingHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

... ...group 1meat 5Molpus ...

... ...standard 1

... ...

train

test

Reuters -21578Data set

Page 20: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 20

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

After ReducingHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

... ...group 1meat 5

... ...standard 1

... ...

train

test

Reuters -21578Data set

Page 21: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 21

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

Normalizing HAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

... ...group 1meat 5

... ...standard 1

... ...

train

test

Reuters -21578Data set

wk =tk x log (ND /nk)

tk term frequency

ND Number of documents in collection

nk number of documents containing tk

is normalized weight of term k

unnormalized weight of term k

2' / www kk'kw

kw

Page 22: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 22

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

After Normalizing HAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

... ...group 0.127meat 0.278

... ...standard 0.012

... ...

train

test

Reuters -21578Data set

wk =tk x log (ND /nk)

tk term frequency

ND Number of documents in collection

nk number of documents containing tk

is normalized weight of term k

unnormalized weight of term k

2' / www kk'kw

kw

Page 23: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 23

overview of experiment

train

test

...Reuters -21578Data set Topics

Effectivenessmeasures

PrePocessing

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing

Training

1. Choose a starting query vector Q0; let k = 0.

2. Let Qk be a query vector at the start of

the (k+1)th iteration; identify thefollowing set of difference vectors:   (Qk) ={b=d- d’ :d d’ and

f(Qk,b) 0}; if (Qk)= ,

Qopt = Qk is a solution

and exit, otherwise, 3. Let Qk+1 = Qk +

 4. k = k+1; go back to Step (2).

)(Qkb

b

TrainingWithSDA

Optimal query

Page 24: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 24

overview of experiment

train

Optimal query

test

...Reuters -21578Data set Topics

Effectivenessmeasures

PrePocessing

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

Training• All the category examples as positive examples • Random 60% from other topicsas negative examples

• If maximum Rnorm value (1)is not reached at maximum 150 iterations set optimal query as the query that produces maximum Rnorm value available

TrainingWithSDA

Page 25: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 25

overview of experiment

TrainingWithSDA

Optimal query

...train

test

Reuters -21578Data set Topics

Effectivenessmeasures

PrePocessing

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

There are 135 topics

Topic # of + earn 2877acq 1650moneyfx 538grain 433crude 389trade 369interest 347wheat 212ship 197corn 182

Topic # of earn 1087acq 719moneyfx 179grain 149crude 189trade 118interest 131wheat 71ship 89corn 56

traintest

Page 26: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 26

overview of experiment

TrainingWithSDA

Optimal query

...train

test

Reuters -21578Data set Topics

Effectivenessmeasures

PrePocessing

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

Create contingency tables

Find breakeven points

Page 27: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 27

ResultsTopic Findism Nbayes SDA Bnets Trees SVM

earn 92,9 95,9 96,32 95,8 97,8 98,0

acq 64,7 87,8 85,26 88,3 89,7 93,6

money-fx 46,7 56,6 68,72 58,8 66,2 74,5

grain 67,5 78,8 71,81 81,4 85,0 94,6

crude 70,1 79,5 82,54 79,6 85,0 88,9

trade 65,1 63,5 65,25 69,0 72,5 75,9

interest 63,4 64,9 61,07 71,3 67,1 77,7

wheat 68,9 69,7 76,06 82,7 92,5 91,9

ship 49,2 85,4 65,17 84,4 74,2 85,6

corn 48,2 65,3 75,00 76,4 91,8 90,3

Avg.Top 10 64,6 81,5 84,54 85,0 88.4 92,0

Avg.All 61,7 75,2 76,37 80,0 N/A 87,0

breakevens

Page 28: Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING

Baskent University Text Filtering 28

Thank you!