bioknot biological knowledge through ontology and tfidf by: james costello advisor: mehmet dalkilic
TRANSCRIPT
BioKnOTBioKnOTBiological Knowledge through Biological Knowledge through
Ontology and TFIDFOntology and TFIDF
By: James Costello
Advisor: Mehmet Dalkilic
22June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
OutlineOutline
Motivation and GoalsMotivation and Goals
BackgroundBackground
Program ArchitectureProgram Architecture
Populating the Article DatabasePopulating the Article Database
Developing an Article Scoring ModelDeveloping an Article Scoring Model
BioKnOT demonstrationBioKnOT demonstration
Summary and Future WorkSummary and Future Work
33June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Motivation and GoalsMotivation and Goals
MotivationMotivation Current online text searching methods are not good Current online text searching methods are not good
enough for highly specific research.enough for highly specific research.ImportanceImportance
TimelinessTimeliness
RelevanceRelevance
Goal of ProjectGoal of Project Create an online text retrieval system that will allow Create an online text retrieval system that will allow
users to construct their own set of highly specific, users to construct their own set of highly specific, timely, and important research articles that are timely, and important research articles that are custom fit to a user’s needs.custom fit to a user’s needs.
44June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
DD = set of documents = set of documentsD’D’ = set of documents that meet some search = set of documents that meet some search criteriacriteriaD’ DD’ D
D’D’ = { = {dd11, d, d22, …d, …dkk}} Where Where ddii is an individual document and we hope is an individual document and we hope ddii is is
more interesting than more interesting than ddi+1i+1
||D’D’| = huge number of documents| = huge number of documents||D’D’| for a filtered search on PubMed for | for a filtered search on PubMed for “apoptosis” is 65,832 articles“apoptosis” is 65,832 articles
∩
Standard Search ModelStandard Search Model
55June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
BioKnOT Search ModelBioKnOT Search Model
DD = set of documents = set of documentsD’D’ = set of documents that meet the initial search criteria = set of documents that meet the initial search criteriaD’ DD’ DD’D’t t = = set of documents that pass the filterset of documents that pass the filterD’D’t t D’ D’D’D’tu tu = = set of documents that have been ranked by based on set of documents that have been ranked by based on semantic content from user inputsemantic content from user inputD’D’tu tu D’D’tt D’D’tutu = { = {dd11, d, d22, …d, …dkk}}
|D’|D’tutu| = very small and very specific| = very small and very specific Where Where ddii is an individual document and is an individual document and ddii is more interesting than is more interesting than
ddi+1i+1
∩∩
∩
66June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Program ArchitectureProgram Architecture
Initial Search Page
Boolean Search
Filter Page
Filter Your Search
apoptosisterm
term term
term
User Input Page
Submit Description
User’s sentences
Results Page
Refine Your Search
1. Article Title …View Word Graph
See All Data2. …
Actual Online Article
All Stored Data On the Article
(title, author(s),…)
Illustration of WordRelationships in
Article
Word WeightingPage
Add Word Weights
Bad Good
term
Hyperlink
Hyperlink
Hyperlink
77June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Populating the Article DatabasePopulating the Article Database
Data we needData we need Author(s)Author(s) Article TitleArticle Title AbstractAbstract Journal title Journal title Date and year of publicationDate and year of publication Count of how many times the article was citedCount of how many times the article was cited URL of online full text article or PubMed Search URL of online full text article or PubMed Search
resultsresults Some Type of Accession NumberSome Type of Accession Number
88June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Resources Used in Resources Used in Populating the DatabasePopulating the Database
Institute of Scientific Information Institute of Scientific Information (ISI) Web of Science(ISI) Web of Science http://bert.lib.indiana.edu:2182/portal.cgihttp://bert.lib.indiana.edu:2182/portal.cgi
EndNote 7EndNote 7
PubMedPubMed http://www.ncbi.nlm.nih.gov/entrez/query.fcgihttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi
99June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Steps Taken to Populate the Article Steps Taken to Populate the Article DatabaseDatabase
ISI’s Web of Science
Search Interface
Endnote 7Export article
information
PubMed
Search Interface
PubMed
Article Abstract Interface
Article Database
> 2,000
Export XML and Parse Web Bot to search for
URL information using article title and author(s)
Either PubMed URL
or Online Article URL inserted
After PubMed Abstract
found, Web Bot searches
for online article URL
1010June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Initial Search Initial Search
Boolean searchBoolean search
Searches all article’s in the database with Searches all article’s in the database with a URLa URL Searches an article’s title and abstractSearches an article’s title and abstract
1111June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Filter Page Filter Page TFIDFTFIDF
LUCAS LUCAS Web ServiceWeb Service http://lair.indiana.edu/research/lucas/index.htmlhttp://lair.indiana.edu/research/lucas/index.html
TFIDF CalculationsTFIDF Calculations TF = number of occurrences of a term in a documentTF = number of occurrences of a term in a document IDF = log of the total number of documents over the number of documents that contain IDF = log of the total number of documents over the number of documents that contain
the desired termthe desired term
tfi,d =|di|
|Σik di|
idfi,D = log2
|D|
|{di | di D}|
tfidfi,d = (1 + tfi,d)idfi,D if tfi,d ≥ 1
1212June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Term Relationship MeasurementsTerm Relationship Measurements
Intra-sentence distanceIntra-sentence distance Sentence structure taken into accountSentence structure taken into account
Inter-sentence distanceInter-sentence distance Sentence structure ignoredSentence structure ignored
“... and is not present in the mitochondria. Permeability is another...”
“... mitochondrial permeability is an important aspect of apoptosis...”
Ex.
1313June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Inter-sentence vs. Intra-sentence Inter-sentence vs. Intra-sentence distancedistance
Searching for the relationship
cell death …cell…
Doc A
…death…
Doc D
…cell death…
Doc B
…cell. Death…
Doc C
…cell death…
Doc E
Documents used to
Construct the Random Model
Initial Search Set of
Documents
Document that are scored and
returned to the user
1414June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Visual Representation of Term Visual Representation of Term RelationshipsRelationships
Example of a Term Relationship Graph that
was specified by the user
Example of a Term Relationship Graph that was taken from an Article’s Abstract
Graph M
Graph N
1515June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Scoring an ArticleScoring an Article
MM = User Defined Term Relationships = User Defined Term Relationships
NN = Abstract of Individual Article Term Relationships = Abstract of Individual Article Term Relationships
SS = Scoring Matrix = Scoring Matrix
P P = Presence or Absence of a Term Relationship from M in N = Presence or Absence of a Term Relationship from M in N
f f = Sigmoidal Term Relationship Function= Sigmoidal Term Relationship Function
Abstract Score = Abstract Score = ∑∑ P PM,NM,N(i,j) (i,j) ×× S Si,ji,j ×× f fMMi,ji,j(N(Ni,ji,j))
PM,N(i,j) = 1 Mi,j × N× Ni,ji,j ≠ ≠ 00
-1 Otherwise-1 Otherwise
1616June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Sigmoidal Scoring FunctionSigmoidal Scoring Function
β - α
x - αif α < x ≤ β
if x ≤ α
1 -
x - αif β < x ≤ γ
1
0 if x > γ
1
0
½
γβα
fMi,j(Ni,j) =
½
½
β - α
Term Distance
% T
erm
Mem
bers
hip
1717June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Scoring Matrix (Random Model)Scoring Matrix (Random Model)
Derived from the TFIDF Terms that were defined by the Derived from the TFIDF Terms that were defined by the user and abstracts of all the articles returned by the user and abstracts of all the articles returned by the initial term search.initial term search.User defined term relationships are found in all the User defined term relationships are found in all the abstracts and the log-odds score is takenabstracts and the log-odds score is taken
((tj | ti, tj | ti, Δ) is found by first finding a word, is found by first finding a word, t tii, that the user , that the user has defined and then opening up a 5 word reading has defined and then opening up a 5 word reading frame, frame, Δ, following , following ttii.. The presence of a second user The presence of a second user defined word, defined word, ttj,j, must be within must be within Δ
LOD Score(ti,tj) = log2 P(tj | ti, Δ)
P(ti) × P(tj)
1818June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Determine important termsDetermine important terms cell, death, humancell, death, human
Look for relationships of those words in the search Look for relationships of those words in the search space.space.
RelationshipsRelationshipscellcell→death, cell→human, death→cell, →death, cell→human, death→cell, death→human, human→cell, human→deathdeath→human, human→cell, human→death
Search Space (abstract)Search Space (abstract)←←The effects … The effects … cellcell in a in a humanhuman … in cancer. … in cancer. →→
Once an important term is found, a 5 word reading frame Once an important term is found, a 5 word reading frame is opened. If a relationship is found within the reading is opened. If a relationship is found within the reading frame, then the distance between the words is taken.frame, then the distance between the words is taken.
cellcell→human = 3→human = 3
If multiple occurrences of the same relationship are If multiple occurrences of the same relationship are found in the search space, the average is taken. found in the search space, the average is taken.
20 words
Steps to derive the Scoring Matrix Steps to derive the Scoring Matrix
1919June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Steps to derive the Scoring MatrixSteps to derive the Scoring Matrix
Lastly, these relationships, along with the individual word Lastly, these relationships, along with the individual word probabilities, can be taken, scored and structured into a matrix.probabilities, can be taken, scored and structured into a matrix.
P(cellP(cell→human) = = .167→human) = = .167 P(cell) = .03P(cell) = .03 P(human) = .06P(human) = .06 LODLOD(cell(cell→human) = 1.97→human) = 1.97 Continue for all relationshipsContinue for all relationships
2
apoptosisapoptosis humanhuman cellcell
apoptosisapoptosis 00 1.271.27 -1.08-1.08HumanHuman 1.641.64 00 00CellCell 2.352.35 1.971.97 00
12
2020June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Adding User Weights to Term Adding User Weights to Term MatrixMatrix
User is asked to enter weights for each User is asked to enter weights for each word relationship that is found within the word relationship that is found within the user’s expansion statement.user’s expansion statement.
Weights range from [0,2]Weights range from [0,2]
Score is noted rScore is noted ri,ji,j for term for term ii to term to term jj
Weights multiplied by matrix values to add Weights multiplied by matrix values to add user’s input into the random model. user’s input into the random model.
2121June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
SSi,ji,j cellcell deathdeath proteinprotein
cellcell 0.00.0 2.542.54 0.00.0
deathdeath 0.980.98 0.00.0 0.00.0
proteinprotein -1.65-1.65 3.653.65 0.00.0
SSi,ji,j cellcell deathdeath proteinprotein
cellcell 0.00.0 5.085.08 0.00.0
deathdeath 0.980.98 0.00.0 0.00.0
proteinprotein -3.30-3.30 5.485.48 0.00.0
Scoring Matrix Before User’s Word Weights
Scoring Matrix After User’s Word Weights
cell death … 2.0
death cell …… 1.0
protein cell …… 0.5
protein death … 1.5
User’s Word Weight
submissions
Final Score Si,j
0 if Si,j = 0
ri,j × × Si,j if Si,j > 0
if Si,j < 0Si,j××1 ri,j
=
2222June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Visual Representation of Term Visual Representation of Term RelationshipsRelationships
Example of a Term Relationship Graph that
was specified by the user
Example of a Term Relationship Graph that was taken from an Article’s Abstract
Graph M
Graph N
2323June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Comparing Term Relationship Comparing Term Relationship GraphsGraphs
In order to compare the word graphs, an In order to compare the word graphs, an adjacency matrix must be created. This is where adjacency matrix must be created. This is where the values of the values of MMi,ji,j and and NNi,ji,j are taken. are taken.
apoptosisapoptosis tumortumor
apoptosisapoptosis 00 5.005.00tumortumor 00 00
fasfas induceinduce
fasfas 00 3.003.00induceinduce 00 00
Matrix M Matrix N
2424June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Results and RefinementResults and Refinement
Support Score in the form of Citation Frequency, which is simply the citation count supplied by ISI’s Web of Science divided by the difference in year from now to the publication date.
Semantic Score from the equation
∑ ∑ PM,N(i,j) × Si,j × fMi,j(Ni,j)PM,N(i,j) × Si,j × fMi,j(Ni,j)
2525June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Software DemonstrationSoftware Demonstration
BioKnOTBioKnOThttp://biokdd.informatics.indiana.edu/cgi-bin/jccostel/thesis/bioknot.cgihttp://biokdd.informatics.indiana.edu/cgi-bin/jccostel/thesis/bioknot.cgi
PubMedPubMedhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMedhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
2626June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
SummarySummary
Offer a new and effective way to search Offer a new and effective way to search research articles.research articles.
BioKnOT offers many features that aid the BioKnOT offers many features that aid the user in deciding what factors are important user in deciding what factors are important in retrieving articles.in retrieving articles.
Currently under submission to SIGIR Currently under submission to SIGIR Bioinformatics workshop.Bioinformatics workshop.
2727June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
Future WorkFuture Work
Adding more sophisticated support Adding more sophisticated support through citation frequency.through citation frequency.
Increase efficiency of scoring methodIncrease efficiency of scoring method
Usability analysisUsability analysis
Incorporate BioKnOT into CATPAIncorporate BioKnOT into CATPA
Developing a Bioinformatics Knowledge Developing a Bioinformatics Knowledge Base locally using BioKnOT.Base locally using BioKnOT.
2828June 11, 2004June 11, 2004 Bioinformatics Capstone Project CostellBioinformatics Capstone Project Costelloo
AcknowledgmentsAcknowledgments
Professor MehmetDalkilicProfessor MehmetDalkilic
Professor Javed MostafaProfessor Javed Mostafa
Professor Sun KimProfessor Sun Kim