unsupervised discovery of compound entities for relationship extraction cartic ramakrishnan, pablo...
TRANSCRIPT
Unsupervised Discovery of Compound Entities for Relationship Extraction
Cartic Ramakrishnan, Pablo N. MendesShaojun Wang, Amit P. Sheth
Kno.e.sis Center, Dept. of Computer Science & EngineeringWright State University, Dayton, OH – USA
http://knoesis.wright.edu
October 2nd, 2008EKAW'08, Acitrezza, Sicily, Italy
Inspiration
● Undiscovered Public Knowledge Swanson's discoveries
● Manually inspecting abstracts, found that Fish Oil relates to Raynaud's Disease
Can we apply automatic algorithms for such discoveries?
● Hypothesis Validation Magnesium relates to Migrane? How do we use knowledge available in the literature
to support a hypothesis?
2
3
Present: Search and Sift
MagnesiumMigraine
PubMed
?Stress
Spreading Cortical Depression
Calcium Channel Blockers
Swanson’s Discoveries
Associations Discovered based on keyword searches followed by manually analysis of text to establish possible relevant relationships
11 possible associations found
Characteristics: Search and Sift
● User Input: keywords● Output: documents that contain keywords● Possible outcomes:
a) Current document contains the answer
b) Doesn't contain: discard, try next result
c) Partially answers: adjust keywords and restart; follow a link
● Comments: results are documents, not the answer themselves links reflect developer intention
4
Semantic Browser
“EGF and hypoxia induce CXCR4 in non-small cell lung cancer...” [PMID:15802268]
“PTEN protein could inhibit cell invasion even in the presence of ... epidermal growth factor receptor (EGFR)” [PMID: 15986432]
Record triple: ● EGFR induces CXCR4
5
Future: Trailblazing
● User input: keywords● Output: annotated text, triples (s, p, o)● Common outcomes:
a) Current triple answers the question
b) Doesn't answer: discard, try next result
c) Partially answer: follow a link to a related fact● Comments:
results are facts (not documents) links reflect conceptual associations following a link leaves a trail 6
Biologically active substance
LipidDisease or Syndrome
affects
causes
affectscauses
complicates
Fish Oils Raynaud’s Disease???????
instance_of instance_of
UMLS Semantic Network
MeSH
PubMed9284 documents
4733 documents
5 documents
Relationship Extraction
...to fill the gap from text to triples
7
Challenges and Opportunities
● Vocabularies, Thesauri, Ontologies, exist in several fields How can we use them?
● lexicon: Fish Oils, Raynaud’s Disease, etc.● types/labels: Fish Oils instance of Lipid● relationships between types: Lipid affects Disease
● Simple identification of ontology terms in text is not enough Compound Entities Complex Relationships
8
Challenge: Compound Entities
9
• Entities (MeSH terms) in sentences occur in modified forms• “adenomatous” modifies “hyperplasia”• “An excessive endogenous or exogenous stimulation” modifies “estrogen”
• Entities can also occur as composites of 2 or more other entities• “adenomatous hyperplasia” and “endometrium” occur as “adenomatous hyperplasia of the endometrium”
What we are not doing
● Spotting Entity Mentions
● NER (Named Entity Recognition)
● Creating an ontology out of text10
Extraction Approach
● Use a dependency parser to segment sentences into SubjPredObject
● Subjects and Objects represent compound entities
● Use corpus statistics to predict constituents of compound entities
12
Extraction Algorithm
Relationship head
Subject head
Object head Object head
13
Preliminary results
14
Extracted Triples
15
Predicting constituents
● Greedy mutual information based word grouping used to predict constituents
● Mutual Information: a measure for discovering interesting word collocations information that two random variables share: it
measures how much knowing one of these variables reduces our uncertainty about the other
16
Representation – Resulting RDF
ModifiersModified entitiesComposite Entities
estrogen
An excessive endogenous or
exogenous stimulation
modified_entity1composite_entity1
modified_entity2
adenomatous hyperplasia
endometrium
hasModifier
hasPart
induces
hasPart
hasPart
hasModifier
hasPart
17
Inspiration Revisited
Text
Extraction of Semantics from text
Assigning interpretation to text
Semantic Metadata Guided
Knowledge Exploration
Semantic Metadata Guided
Knowledge Discovery
Triple-basedSemantic
Search
Semanticbrowser
Subgraphdiscovery
Semantic metadata in the form ofsemi-structured data
[EKAW'08][accepted WI'08] [SIGKDD'05][EKAW'08]
[ISWC'06]
QueryFormulation
[ICSC'08]
References
● C. Ramakrishnan, W. H. Milnor, M. Perry and A. P. Sheth "Discovering Informative Connection Subgraphs in Multi-relational Graphs". In SIGKDD Explorations 7(2): 56-63 (2005)
● Cartic Ramakrishnan, Krys Kochut, Amit P. Sheth: A Framework for Schema-Driven Relationship Discovery from Unstructured Text. International Semantic Web Conference 2006: 583-596
● “Joint Extraction of Compound Entities and Relationships to support Semantic Browsing over Biomedical Literature” Cartic Ramakrishnan, Pablo N. Mendes, Rodrigo A.T.S da Gama, Guilherme C. N. Ferreira & Amit P. Sheth. Web Intelligence WI'08, Australia.
● "Unsupervised Discovery of Compound Entities for Relationship Extraction" Cartic Ramakrishnan, Pablo N. Mendes, Shaojun Wang and Amit P. Sheth. 16th International Conference on Knowledge Engineering and Knowledge Management Knowledge Patterns. EKAW'08, Sicily, Italia.
19
Pablo N. MendesPablo N. Mendesmendes.2 @ wright.edumendes.2 @ wright.edu http://knoesis.wright.eduhttp://knoesis.wright.edu
Thank you!Thank you!
10/08/08 20
Challenge: Compound Entities
● Structurally complex Nested Overlapping Discontinuous
● “Semantically” complex Ontology term + Modifiers Compositions of ontology terms
● Large number of possible combinations of terms: low probability an ontology would contain all of
them
21
Paths between Migraine and Magnesium
Paths are considered interesting if they have one or more named relationshipOther than hasPart or hasModifiers in them
22
An example of such a path
platelet(D001792)
collagen(D003094)
migraine(D008881)
magnesium(D008274)
me_3142by_a_primary_abnormality_of_platelet_behavior
me_2286_13%_and_17%_adp_and_collagen_induced_platelet_aggregation
caused_by
hasPart
hasPart
stimulated
stimulatedhasPart
23
Evaluation conducted using this toolEvaluation on OMIM RDF
1938 triple-sentence comparisons
2 evaluators (not domain experts)
Currently system in use by domain experts at
● CCHMC (2 experts)● Awaiting results
24
Results of Manual Evaluation