the development of sharing publication citation information website with article search system using...

40
The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana Lim, M.Eng. Adi Wibowo, M.T.

Upload: cali-hairfield

Post on 14-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

The Development of Sharing Publication Citation Information Website with Article Search System

Using OKAPI BM25

Author

Hartono (26405055)

Supervisors

Resmana Lim, M.Eng.

Adi Wibowo, M.T.

Page 2: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

• The need to obtain the necessary scientific journal.• Limited access to obtaining scientific journal.• The need to get article information, not only by harvesting, but also manual.• The need to obtain better search result.

Background

Page 3: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Problem :•How to get article information by harvesting from external journal site?•How to input article which formated BibTex, XML or PDF into database?•How to harvest article automatically at a certain period?•How to do indexes of article exist in database?•How to search by using OKAPI BM25 of existing article in database?Goal :•To develop information-sharing site for more complete article information and make user get the desired information

Problem & Goal

Page 4: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Context Diagram

Page 5: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Context Diagram

Page 6: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Harvesting Process start

End

Baca url oai request

Cek Metadata

Valid?

Harvest metadata

Y

N

Database artikel

Download source metadata

metadataformat verb example :http://citeseerx.ist.psu.edu/oai2? verb=ListMetadataFormats

listidentifiers verb example :http://citeseerx.ist.psu.edu/oai2? verb=ListIdentifiers&from=2010-03-17&until=2010-03-18&metadataPrefix=oai_dc

getrecord verb example : http://citeseerx.ist.psu.edu/oai2? verb=GetRecord&identifier=oai:CiteSeerXPSU:10.1.1.1.2918&metadataPrefix=oai_dc

listrecord verb example :http://citeseerx.ist.psu.edu/oai2?verb=ListRecords&from=2010-03-17&until=2010-03-18&metadataPrefix=oai_dc

Page 7: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Article Management Processstart

Baca artikel dari

database dan user

Approve?

Y

end

Nindexing

Page 8: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Indexing Process

Page 9: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Title Process Description ProcessProses Judul

Baca judul

Explode (judul)

Stopword (judul)

Stemming (judul)

Title_term = Title_term+1

Masih ada term?

return

N

Y

Proses description

Baca description

Explode (description)

Stopword (description)

Stemming (description)

description_term=description_term+

1

Masih ada term?

return

N

Y

Page 10: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Content Process Creator ProcessProses content

Baca content

Explode (content)

Stopword (content)

Stemming (content)

fullbody_term = fullbody_term+1

Masih ada term?

return

N

Y

Proses creator

Baca creator

Explode (creator)

creator_term = creator_term+1

Masih ada term?

return

N

Y

Page 11: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Explode Process Stop Word Process

Explode (input)

Baca input

Hilangkan tanda baca

Pecah kalimat menjadi kata

return

Proses stopword

(term)

Baca input yang sudah di explode

Stopword inggris?

Stopword indonesia?

Term tanpa stopword

N

N

Hapus term inggris

Hapus term indonesia

Y

Y

return

Page 12: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Stemming Process Hitung f(qi,D) ProcessStemming

(term)

Term tanpa stopword

Irregular verb?

Term ada di english lib?

Term hasil stemming

return

N

N

Y

Y

Stemming inggris

Stemming indonesia

Hitung f(qi,D)

Baca bobot term

TF = (title*bobot title) + (description*bobot description) +

fullbody_term-(title+description)*bobot fullbody

Update total_term, isi dengan TF

return

Page 13: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Total Artikel Process Hitung IDF Process

Hitung total term artikel

Jumlah total_term pada doc_term sesuai identifier

Update total_term pada article dengan hasil penambahan

return

Hitung idf

Hitung jumlah article (N)

Ambil semua master_term_id

dari master_term

Hitung jumlah article dari

doc_term yang sama dengan

master_term (n)

IDF = log10(((N-n) + 0.5) / (n+0.5)) + log10(0.5/(N+0.5))*-1

Update idf, isi dengan IDF

Masih ada term?

return

N

Y

Page 14: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Avgdl Process Search Process

Hitung avgdl

Baca semua

total_term pada article

Hitung rata-rata total_term

Update average_article isi dengan rata-rata

total term

return

start

Cari semua artikel yang memiliki

keyword

Ketemu?

Sorting hasil

end

Y

N

Hitung okapi

Input keyword

explode

stemming

stopword

Page 15: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

OKAPI Process User Management ProcessHitung okapi

Ambil idf, word = keyword search

Ambil total_term dari doc term

(f(qi,D)

Jumlahkan fullbody_term dari

doc_term (|D|)

TF = (f(qi,D)*(k1+1)) / (f(qi,D)+k1*(1-b+b*(|D|/avgdl)

return

K1 = 2B = 0,75

start

Baca data member

end

File member

Valid?

Y

N

Page 16: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Message Managementstart

Baca inputan

message

end

File message

Valid?

Y

N

Page 17: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Entity Relationship Diagram (ERD)

memiliki

mempunyai

mempunyai

menulis

memasukkan

memilikimempunyai

memiliki

memiliki

memiliki

memiliki

memiliki

memiliki

memiliki

article

oai_identifierdatestam pdc_titledc_descriptionjournaleditorseriesdc_publishervolum enumbermonthaddressbook_titlepagesdc_form atdc_typedc_identifierdc_languagedc_coveragedc_rightsoai_idpublishedapprovaltotal_termscategory

article_average

article_average

category

category_name

citation

oai_identifiercitate

contributor

contributor_idoai_identifiercontributor

creator

creator_idoai_identifiercreator_nam e

date_article

date_idoai_identifierdc_date

doc_term

oai_identifiermaster_term_idtitle_termdescription_termfullbody_termcreator_termtotal_term

english_lib

idkata

harvest_tim e

oai_iddate_fromdate_until indexing_tim e

oai_identifiertime

irreg_verb

idkata_dasarkata_bkn_dasar

message

fromemailsubjectmessagemessage_status

master_term

master_term_idwordidf

oai_request

oai_idoai_urloai_statusreferfolder

refrerence

reference_idoai_identifierrelation

source

oai_identifierdownload_statussource

stop_word_eng

idkata

stop_word_indo

idkata

subject

subject_idoai_identifiersubject

term

title_termdescription_termfullbody_term

user

usernamepassworduser_statusfullnameemailinstitutionprofessionlast_vis itjoin_date

Page 18: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

OKAPI BM25OKAPI BM25• Okapi BM25 is a function of ratings used search engines to give ratings on the desired documents based on relevance to a given query.

OKAPI BM25 Formula

Inverse Document Frequency

Page 19: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Article example :

Article Example

Title Description Content

Oai1 complex stockhast Numer analysi Model complex real

detail analysi build

Oai2 Managed abstrach

build

Manner detail Join creation numer

make possibl

Oai3 Structur detail

possibl

Real abstrach world Make detail usual

manner

Oai4 Build world explor Analysi detail Managed stockhast

replicating complex

explor

Page 20: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Manual :

Manual & Program IDF Calculation

Program :

Page 21: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Keyword example : complexManual : Program :

Manual & Program OKAPI Calculation

Page 22: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Article : 500 Keyword : Network SystemSearch result= 198 articleResult maybe relevan= 29 articleRelevan article result = 12Recall = 12/12 *100% = 100%Precision = 12/198 *100% = 6%

Recall Precision

Oai identifier Relevan Search rank

oai:CiteSeerXPSU:10.1.1.1.3301 tidak 15

oai:CiteSeerXPSU:10.1.1.1.8714 tidak 12

oai:CiteSeerXPSU:10.1.1.11.3246 ya 8

oai:CiteSeerXPSU:10.1.1.131.2961 tidak 6

oai:CiteSeerXPSU:10.1.1.133.114 ya 3

Page 23: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Recall Precision ContinueOai identifier Relevan Search rank

oai:CiteSeerXPSU:10.1.1.133.5166 tidak 16

oai:CiteSeerXPSU:10.1.1.134.7415 tidak 25

oai:CiteSeerXPSU:10.1.1.135.7151 tidak 13

oai:CiteSeerXPSU:10.1.1.138.8592 ya 5

oai:CiteSeerXPSU:10.1.1.143.7835 ya 24

oai:CiteSeerXPSU:10.1.1.143.9199 tidak 28

oai:CiteSeerXPSU:10.1.1.147.3140 ya 9

oai:CiteSeerXPSU:10.1.1.148.6013 ya 10

oai:CiteSeerXPSU:10.1.1.149.7229 tidak 18

oai:CiteSeerXPSU:10.1.1.2.8672 tidak 29

oai:CiteSeerXPSU:10.1.1.2.876 ya 4

oai:CiteSeerXPSU:10.1.1.28.2069 tidak 21

oai:CiteSeerXPSU:10.1.1.28.3751 tidak 23

oai:CiteSeerXPSU:10.1.1.31.5233 ya 17

Page 24: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Recall Precision ContinueOai identifier Relevan Search rank

oai:CiteSeerXPSU:10.1.1.32.3394 tidak 19

oai:CiteSeerXPSU:10.1.1.34.422 ya 20

oai:CiteSeerXPSU:10.1.1.37.133 tidak 26

oai:CiteSeerXPSU:10.1.1.37.886 tidak 27

oai:CiteSeerXPSU:10.1.1.46.7941 ya 1

oai:CiteSeerXPSU:10.1.1.5.5436 ya 2

oai:CiteSeerXPSU:10.1.1.61.8860 tidak 22

oai:CiteSeerXPSU:10.1.1.62.5142 tidak 14

oai:CiteSeerXPSU:10.1.1.8.4971 tidak 11

oai:CiteSeerXPSU:10.1.1.94.3465 ya 7

Page 25: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Keyword : music modelSearch result = 150 articleResult maybe relevan = 30 articleRelevan article result = 14Recall = 14/14 *100% = 100%Precision = 14/150 *100% = 9.3%

Recall Precision Continue

Oai identifier Relevan Search rank

oai:CiteSeerXPSU:10.1.1.10.1860 ya 19

oai:CiteSeerXPSU:10.1.1.10.2860 tidak 29

oai:CiteSeerXPSU:10.1.1.111.3072 ya 18

oai:CiteSeerXPSU:10.1.1.127.8691 ya 21

oai:CiteSeerXPSU:10.1.1.130.1856 ya 6

oai:CiteSeerXPSU:10.1.1.133.7089 tidak 27

oai:CiteSeerXPSU:10.1.1.140.3374 tidak 10

Page 26: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Recall Precision ContinueOai identifier Relevan Search rank

oai:CiteSeerXPSU:10.1.1.140.8940 ya 25

oai:CiteSeerXPSU:10.1.1.142.7598 tidak 12

oai:CiteSeerXPSU:10.1.1.149.6567 ya 30

oai:CiteSeerXPSU:10.1.1.152.2688 ya 11

oai:CiteSeerXPSU:10.1.1.154.24 tidak 16

oai:CiteSeerXPSU:10.1.1.154.2529 ya 20

oai:CiteSeerXPSU:10.1.1.155.1750 tidak 33

oai:CiteSeerXPSU:10.1.1.16.7401 tidak 32

oai:CiteSeerXPSU:10.1.1.17.1013 ya 1

oai:CiteSeerXPSU:10.1.1.18.6229 tidak 13

oai:CiteSeerXPSU:10.1.1.2.6849 tidak 31

oai:CiteSeerXPSU:10.1.1.2.8672 tidak 8

oai:CiteSeerXPSU:10.1.1.20.3633 ya 15

oai:CiteSeerXPSU:10.1.1.31.5233 ya 7

Page 27: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Recall Precision ContinueOai identifier Relevan Search rank

oai:CiteSeerXPSU:10.1.1.32.5049 tidak 24

oai:CiteSeerXPSU:10.1.1.34.7828 ya 4

oai:CiteSeerXPSU:10.1.1.4.677 ya 5

oai:CiteSeerXPSU:10.1.1.4.7323 ya 3

oai:CiteSeerXPSU:10.1.1.5.1181 tidak 23

oai:CiteSeerXPSU:10.1.1.5.4681 tidak 17

oai:CiteSeerXPSU:10.1.1.52.4788 tidak 28

oai:CiteSeerXPSU:10.1.1.57.3576 tidak 14

oai:CiteSeerXPSU:10.1.1.59.9118 tidak 9

Page 28: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Keyword : music analysisSearch result = 116 articleResult maybe relevan = 23 articleRelevan article result= 10Recall = 10/10 *100% = 100%Precision = 10/116 *100% = 8.6%

Recall Precision Continue

Oai identifier Relevan Search rank

oai:CiteSeerXPSU:10.1.1.10.2860 ya 22

oai:CiteSeerXPSU:10.1.1.10.3132 ya 2

oai:CiteSeerXPSU:10.1.1.140.3374 tidak 3

oai:CiteSeerXPSU:10.1.1.140.8940 tidak 9

oai:CiteSeerXPSU:10.1.1.145.8953 ya 5

oai:CiteSeerXPSU:10.1.1.149.6567 tidak 23

oai:CiteSeerXPSU:10.1.1.154.2529 ya 19

Page 29: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Recall Precision ContinueOai identifier Relevan Search rank

oai:CiteSeerXPSU:10.1.1.155.1750 ya 17

oai:CiteSeerXPSU:10.1.1.155.4454 ya 10

oai:CiteSeerXPSU:10.1.1.156.2520 ya 20

oai:CiteSeerXPSU:10.1.1.18.6229 tidak 13

oai:CiteSeerXPSU:10.1.1.2.6849 tidak 21

oai:CiteSeerXPSU:10.1.1.2.8672 ya 1

oai:CiteSeerXPSU:10.1.1.25.747 tidak 18

oai:CiteSeerXPSU:10.1.1.29.4192 tidak 11

oai:CiteSeerXPSU:10.1.1.34.7828 ya 7

oai:CiteSeerXPSU:10.1.1.4.7323 tidak 4

oai:CiteSeerXPSU:10.1.1.5.1181 tidak 16

oai:CiteSeerXPSU:10.1.1.5.4681 ya 15

oai:CiteSeerXPSU:10.1.1.155.1750 ya 17

oai:CiteSeerXPSU:10.1.1.52.4788 tidak 12

Page 30: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Recall Precision ContinueOai identifier Relevan Search rank

oai:CiteSeerXPSU:10.1.1.59.9118 tidak 6

oai:CiteSeerXPSU:10.1.1.6.3984 tidak 14

oai:CiteSeerXPSU:10.1.1.6.757 tidak 8

Page 31: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Article : 500

Indexing Time

Jumlah artikel Waktu yang diperlukan (dtk)

100 artikel 805.1392138 detik

200 artikel 1646.911684 detik

300 artikel 2509.824728 detik

400 artikel 3514.183314 detik

500 artikel 4744.517922 detik

Page 32: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Article : 500

Indexing Time

Jumlah artikel Waktu yang diperlukan (dtk)

100 artikel 805.1392138 detik

200 artikel 1646.911684 detik

300 artikel 2509.824728 detik

400 artikel 3514.183314 detik

500 artikel 4744.517922 detik

Page 33: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Article : 500Keyword : computer analysis search result: 140 artikel, Time :0.549877882004 second

Search Time

Page 34: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Keyword : user applicationssearch result : 92 artikel, Time : 0.547022104263 second

Search Time Continue

Page 35: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Keyword : work schemesearch result : 92 artikel, Time : 0.491093873978 second

Search Time Continue

Page 36: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Keyword : high image transformsearch result : 101 artikel, Time : 0.498678922653 second

Search Time Continue

Page 37: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Keyword : networksearch result : 76 artikel, Time : 0.270733833313 second

Search Time Continue

Page 38: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Conclusion1.System only can perform metadata harvesting process with oai_dc metadataformat.2.System only can updating automatically on the approved url.3.Time needed by system to generated keyword-related article is varied, according the number of articles produced.4.Recall on search result is very good, because it has an average of 100% while the precision is bad enough because it had an average of less than 10%. The result was good enough because of all articles that may be relevant if they are rated less than 30.

Conclusion

Page 39: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Suggestion1.The system can be developed in order to become data providers.2.The system can be dynamically able to harvest other metadata formats.

Suggestion

Page 40: The Development of Sharing Publication Citation Information Website with Article Search System Using OKAPI BM25 Author Hartono (26405055) Supervisors Resmana

Thank You For Your Attention