bioknot biological knowledge through ontology and tfidf by: james costello advisor: mehmet dalkilic

28
BioKnOT BioKnOT Biological Knowledge Biological Knowledge through Ontology and through Ontology and TFIDF TFIDF By: James Costello Advisor: Mehmet Dalkilic

Upload: drusilla-reeves

Post on 12-Jan-2016

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

BioKnOTBioKnOTBiological Knowledge through Biological Knowledge through

Ontology and TFIDFOntology and TFIDF

By: James Costello

Advisor: Mehmet Dalkilic

Page 2: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

22June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

OutlineOutline

Motivation and GoalsMotivation and Goals

BackgroundBackground

Program ArchitectureProgram Architecture

Populating the Article DatabasePopulating the Article Database

Developing an Article Scoring ModelDeveloping an Article Scoring Model

BioKnOT demonstrationBioKnOT demonstration

Summary and Future WorkSummary and Future Work

Page 3: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

33June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Motivation and GoalsMotivation and Goals

MotivationMotivation Current online text searching methods are not good Current online text searching methods are not good

enough for highly specific research.enough for highly specific research.ImportanceImportance

TimelinessTimeliness

RelevanceRelevance

Goal of ProjectGoal of Project Create an online text retrieval system that will allow Create an online text retrieval system that will allow

users to construct their own set of highly specific, users to construct their own set of highly specific, timely, and important research articles that are timely, and important research articles that are custom fit to a user’s needs.custom fit to a user’s needs.

Page 4: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

44June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

DD = set of documents = set of documentsD’D’ = set of documents that meet some search = set of documents that meet some search criteriacriteriaD’ DD’ D

D’D’ = { = {dd11, d, d22, …d, …dkk}} Where Where ddii is an individual document and we hope is an individual document and we hope ddii is is

more interesting than more interesting than ddi+1i+1

||D’D’| = huge number of documents| = huge number of documents||D’D’| for a filtered search on PubMed for | for a filtered search on PubMed for “apoptosis” is 65,832 articles“apoptosis” is 65,832 articles

Standard Search ModelStandard Search Model

Page 5: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

55June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

BioKnOT Search ModelBioKnOT Search Model

DD = set of documents = set of documentsD’D’ = set of documents that meet the initial search criteria = set of documents that meet the initial search criteriaD’ DD’ DD’D’t t = = set of documents that pass the filterset of documents that pass the filterD’D’t t D’ D’D’D’tu tu = = set of documents that have been ranked by based on set of documents that have been ranked by based on semantic content from user inputsemantic content from user inputD’D’tu tu D’D’tt D’D’tutu = { = {dd11, d, d22, …d, …dkk}}

|D’|D’tutu| = very small and very specific| = very small and very specific Where Where ddii is an individual document and is an individual document and ddii is more interesting than is more interesting than

ddi+1i+1

∩∩

Page 6: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

66June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Program ArchitectureProgram Architecture

Initial Search Page

Boolean Search

Filter Page

Filter Your Search

apoptosisterm

term term

term

User Input Page

Submit Description

User’s sentences

Results Page

Refine Your Search

1. Article Title …View Word Graph

See All Data2. …

Actual Online Article

All Stored Data On the Article

(title, author(s),…)

Illustration of WordRelationships in

Article

Word WeightingPage

Add Word Weights

Bad Good

term

Hyperlink

Hyperlink

Hyperlink

Page 7: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

77June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Populating the Article DatabasePopulating the Article Database

Data we needData we need Author(s)Author(s) Article TitleArticle Title AbstractAbstract Journal title Journal title Date and year of publicationDate and year of publication Count of how many times the article was citedCount of how many times the article was cited URL of online full text article or PubMed Search URL of online full text article or PubMed Search

resultsresults Some Type of Accession NumberSome Type of Accession Number

Page 8: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

88June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Resources Used in Resources Used in Populating the DatabasePopulating the Database

Institute of Scientific Information Institute of Scientific Information (ISI) Web of Science(ISI) Web of Science http://bert.lib.indiana.edu:2182/portal.cgihttp://bert.lib.indiana.edu:2182/portal.cgi

EndNote 7EndNote 7

PubMedPubMed http://www.ncbi.nlm.nih.gov/entrez/query.fcgihttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi

Page 9: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

99June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Steps Taken to Populate the Article Steps Taken to Populate the Article DatabaseDatabase

ISI’s Web of Science

Search Interface

Endnote 7Export article

information

PubMed

Search Interface

PubMed

Article Abstract Interface

Article Database

> 2,000

Export XML and Parse Web Bot to search for

URL information using article title and author(s)

Either PubMed URL

or Online Article URL inserted

After PubMed Abstract

found, Web Bot searches

for online article URL

Page 10: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

1010June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Initial Search Initial Search

Boolean searchBoolean search

Searches all article’s in the database with Searches all article’s in the database with a URLa URL Searches an article’s title and abstractSearches an article’s title and abstract

Page 11: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

1111June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Filter Page Filter Page TFIDFTFIDF

LUCAS LUCAS Web ServiceWeb Service http://lair.indiana.edu/research/lucas/index.htmlhttp://lair.indiana.edu/research/lucas/index.html

TFIDF CalculationsTFIDF Calculations TF = number of occurrences of a term in a documentTF = number of occurrences of a term in a document IDF = log of the total number of documents over the number of documents that contain IDF = log of the total number of documents over the number of documents that contain

the desired termthe desired term

tfi,d =|di|

|Σik di|

idfi,D = log2

|D|

|{di | di D}|

tfidfi,d = (1 + tfi,d)idfi,D if tfi,d ≥ 1

Page 12: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

1212June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Term Relationship MeasurementsTerm Relationship Measurements

Intra-sentence distanceIntra-sentence distance Sentence structure taken into accountSentence structure taken into account

Inter-sentence distanceInter-sentence distance Sentence structure ignoredSentence structure ignored

“... and is not present in the mitochondria. Permeability is another...”

“... mitochondrial permeability is an important aspect of apoptosis...”

Ex.

Page 13: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

1313June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Inter-sentence vs. Intra-sentence Inter-sentence vs. Intra-sentence distancedistance

Searching for the relationship

cell death …cell…

Doc A

…death…

Doc D

…cell death…

Doc B

…cell. Death…

Doc C

…cell death…

Doc E

Documents used to

Construct the Random Model

Initial Search Set of

Documents

Document that are scored and

returned to the user

Page 14: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

1414June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Visual Representation of Term Visual Representation of Term RelationshipsRelationships

Example of a Term Relationship Graph that

was specified by the user

Example of a Term Relationship Graph that was taken from an Article’s Abstract

Graph M

Graph N

Page 15: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

1515June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Scoring an ArticleScoring an Article

MM = User Defined Term Relationships = User Defined Term Relationships

NN = Abstract of Individual Article Term Relationships = Abstract of Individual Article Term Relationships

SS = Scoring Matrix = Scoring Matrix

P P = Presence or Absence of a Term Relationship from M in N = Presence or Absence of a Term Relationship from M in N

f f = Sigmoidal Term Relationship Function= Sigmoidal Term Relationship Function

Abstract Score = Abstract Score = ∑∑ P PM,NM,N(i,j) (i,j) ×× S Si,ji,j ×× f fMMi,ji,j(N(Ni,ji,j))

PM,N(i,j) = 1 Mi,j × N× Ni,ji,j ≠ ≠ 00

-1 Otherwise-1 Otherwise

Page 16: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

1616June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Sigmoidal Scoring FunctionSigmoidal Scoring Function

β - α

x - αif α < x ≤ β

if x ≤ α

1 -

x - αif β < x ≤ γ

1

0 if x > γ

1

0

½

γβα

fMi,j(Ni,j) =

½

½

β - α

Term Distance

% T

erm

Mem

bers

hip

Page 17: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

1717June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Scoring Matrix (Random Model)Scoring Matrix (Random Model)

Derived from the TFIDF Terms that were defined by the Derived from the TFIDF Terms that were defined by the user and abstracts of all the articles returned by the user and abstracts of all the articles returned by the initial term search.initial term search.User defined term relationships are found in all the User defined term relationships are found in all the abstracts and the log-odds score is takenabstracts and the log-odds score is taken

((tj | ti, tj | ti, Δ) is found by first finding a word, is found by first finding a word, t tii, that the user , that the user has defined and then opening up a 5 word reading has defined and then opening up a 5 word reading frame, frame, Δ, following , following ttii.. The presence of a second user The presence of a second user defined word, defined word, ttj,j, must be within must be within Δ

LOD Score(ti,tj) = log2 P(tj | ti, Δ)

P(ti) × P(tj)

Page 18: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

1818June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Determine important termsDetermine important terms cell, death, humancell, death, human

Look for relationships of those words in the search Look for relationships of those words in the search space.space.

RelationshipsRelationshipscellcell→death, cell→human, death→cell, →death, cell→human, death→cell, death→human, human→cell, human→deathdeath→human, human→cell, human→death

Search Space (abstract)Search Space (abstract)←←The effects … The effects … cellcell in a in a humanhuman … in cancer. … in cancer. →→

Once an important term is found, a 5 word reading frame Once an important term is found, a 5 word reading frame is opened. If a relationship is found within the reading is opened. If a relationship is found within the reading frame, then the distance between the words is taken.frame, then the distance between the words is taken.

cellcell→human = 3→human = 3

If multiple occurrences of the same relationship are If multiple occurrences of the same relationship are found in the search space, the average is taken. found in the search space, the average is taken.

20 words

Steps to derive the Scoring Matrix Steps to derive the Scoring Matrix

Page 19: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

1919June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Steps to derive the Scoring MatrixSteps to derive the Scoring Matrix

Lastly, these relationships, along with the individual word Lastly, these relationships, along with the individual word probabilities, can be taken, scored and structured into a matrix.probabilities, can be taken, scored and structured into a matrix.

P(cellP(cell→human) = = .167→human) = = .167 P(cell) = .03P(cell) = .03 P(human) = .06P(human) = .06 LODLOD(cell(cell→human) = 1.97→human) = 1.97 Continue for all relationshipsContinue for all relationships

2

apoptosisapoptosis humanhuman cellcell

apoptosisapoptosis 00 1.271.27 -1.08-1.08HumanHuman 1.641.64 00 00CellCell 2.352.35 1.971.97 00

12

Page 20: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

2020June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Adding User Weights to Term Adding User Weights to Term MatrixMatrix

User is asked to enter weights for each User is asked to enter weights for each word relationship that is found within the word relationship that is found within the user’s expansion statement.user’s expansion statement.

Weights range from [0,2]Weights range from [0,2]

Score is noted rScore is noted ri,ji,j for term for term ii to term to term jj

Weights multiplied by matrix values to add Weights multiplied by matrix values to add user’s input into the random model. user’s input into the random model.

Page 21: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

2121June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

SSi,ji,j cellcell deathdeath proteinprotein

cellcell 0.00.0 2.542.54 0.00.0

deathdeath 0.980.98 0.00.0 0.00.0

proteinprotein -1.65-1.65 3.653.65 0.00.0

SSi,ji,j cellcell deathdeath proteinprotein

cellcell 0.00.0 5.085.08 0.00.0

deathdeath 0.980.98 0.00.0 0.00.0

proteinprotein -3.30-3.30 5.485.48 0.00.0

Scoring Matrix Before User’s Word Weights

Scoring Matrix After User’s Word Weights

cell death … 2.0

death cell …… 1.0

protein cell …… 0.5

protein death … 1.5

User’s Word Weight

submissions

Final Score Si,j

0 if Si,j = 0

ri,j × × Si,j if Si,j > 0

if Si,j < 0Si,j××1 ri,j

=

Page 22: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

2222June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Visual Representation of Term Visual Representation of Term RelationshipsRelationships

Example of a Term Relationship Graph that

was specified by the user

Example of a Term Relationship Graph that was taken from an Article’s Abstract

Graph M

Graph N

Page 23: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

2323June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Comparing Term Relationship Comparing Term Relationship GraphsGraphs

In order to compare the word graphs, an In order to compare the word graphs, an adjacency matrix must be created. This is where adjacency matrix must be created. This is where the values of the values of MMi,ji,j and and NNi,ji,j are taken. are taken.

apoptosisapoptosis tumortumor

apoptosisapoptosis 00 5.005.00tumortumor 00 00

fasfas induceinduce

fasfas 00 3.003.00induceinduce 00 00

Matrix M Matrix N

Page 24: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

2424June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Results and RefinementResults and Refinement

Support Score in the form of Citation Frequency, which is simply the citation count supplied by ISI’s Web of Science divided by the difference in year from now to the publication date.

Semantic Score from the equation

∑ ∑ PM,N(i,j) × Si,j × fMi,j(Ni,j)PM,N(i,j) × Si,j × fMi,j(Ni,j)

Page 25: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

2525June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Software DemonstrationSoftware Demonstration

BioKnOTBioKnOThttp://biokdd.informatics.indiana.edu/cgi-bin/jccostel/thesis/bioknot.cgihttp://biokdd.informatics.indiana.edu/cgi-bin/jccostel/thesis/bioknot.cgi

PubMedPubMedhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMedhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed

Page 26: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

2626June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

SummarySummary

Offer a new and effective way to search Offer a new and effective way to search research articles.research articles.

BioKnOT offers many features that aid the BioKnOT offers many features that aid the user in deciding what factors are important user in deciding what factors are important in retrieving articles.in retrieving articles.

Currently under submission to SIGIR Currently under submission to SIGIR Bioinformatics workshop.Bioinformatics workshop.

Page 27: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

2727June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

Future WorkFuture Work

Adding more sophisticated support Adding more sophisticated support through citation frequency.through citation frequency.

Increase efficiency of scoring methodIncrease efficiency of scoring method

Usability analysisUsability analysis

Incorporate BioKnOT into CATPAIncorporate BioKnOT into CATPA

Developing a Bioinformatics Knowledge Developing a Bioinformatics Knowledge Base locally using BioKnOT.Base locally using BioKnOT.

Page 28: BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic

2828June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo

AcknowledgmentsAcknowledgments

Professor MehmetDalkilicProfessor MehmetDalkilic

Professor Javed MostafaProfessor Javed Mostafa

Professor Sun KimProfessor Sun Kim