unsupervised discovery of compound entities for relationship extraction cartic ramakrishnan, pablo...

24
Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth [email protected] Kno.e.sis Center, Dept. of Computer Science & Engineering Wright State University, Dayton, OH – USA http://knoesis.wright.edu October 2nd, 2008 EKAW'08, Acitrezza, Sicily, Italy

Upload: august-greer

Post on 21-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

Unsupervised Discovery of Compound Entities for Relationship Extraction

Cartic Ramakrishnan, Pablo N. MendesShaojun Wang, Amit P. Sheth

[email protected]

Kno.e.sis Center, Dept. of Computer Science & EngineeringWright State University, Dayton, OH – USA

http://knoesis.wright.edu

October 2nd, 2008EKAW'08, Acitrezza, Sicily, Italy

Page 2: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

Inspiration

● Undiscovered Public Knowledge Swanson's discoveries

● Manually inspecting abstracts, found that Fish Oil relates to Raynaud's Disease

Can we apply automatic algorithms for such discoveries?

● Hypothesis Validation Magnesium relates to Migrane? How do we use knowledge available in the literature

to support a hypothesis?

2

Page 3: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

3

Present: Search and Sift

MagnesiumMigraine

PubMed

?Stress

Spreading Cortical Depression

Calcium Channel Blockers

Swanson’s Discoveries

Associations Discovered based on keyword searches followed by manually analysis of text to establish possible relevant relationships

11 possible associations found

Page 4: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

Characteristics: Search and Sift

● User Input: keywords● Output: documents that contain keywords● Possible outcomes:

a) Current document contains the answer

b) Doesn't contain: discard, try next result

c) Partially answers: adjust keywords and restart; follow a link

● Comments: results are documents, not the answer themselves links reflect developer intention

4

Page 5: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

Semantic Browser

“EGF and hypoxia induce CXCR4 in non-small cell lung cancer...” [PMID:15802268]

“PTEN protein could inhibit cell invasion even in the presence of ... epidermal growth factor receptor (EGFR)” [PMID: 15986432]

Record triple: ● EGFR induces CXCR4

5

Page 6: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

Future: Trailblazing

● User input: keywords● Output: annotated text, triples (s, p, o)● Common outcomes:

a) Current triple answers the question

b) Doesn't answer: discard, try next result

c) Partially answer: follow a link to a related fact● Comments:

results are facts (not documents) links reflect conceptual associations following a link leaves a trail 6

Page 7: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

Biologically active substance

LipidDisease or Syndrome

affects

causes

affectscauses

complicates

Fish Oils Raynaud’s Disease???????

instance_of instance_of

UMLS Semantic Network

MeSH

PubMed9284 documents

4733 documents

5 documents

Relationship Extraction

...to fill the gap from text to triples

7

Page 8: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

Challenges and Opportunities

● Vocabularies, Thesauri, Ontologies, exist in several fields How can we use them?

● lexicon: Fish Oils, Raynaud’s Disease, etc.● types/labels: Fish Oils instance of Lipid● relationships between types: Lipid affects Disease

● Simple identification of ontology terms in text is not enough Compound Entities Complex Relationships

8

Page 9: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

Challenge: Compound Entities

9

• Entities (MeSH terms) in sentences occur in modified forms• “adenomatous” modifies “hyperplasia”• “An excessive endogenous or exogenous stimulation” modifies “estrogen”

• Entities can also occur as composites of 2 or more other entities• “adenomatous hyperplasia” and “endometrium” occur as “adenomatous hyperplasia of the endometrium”

Page 10: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

What we are not doing

● Spotting Entity Mentions

● NER (Named Entity Recognition)

● Creating an ontology out of text10

Page 11: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

Extraction Approach

● Use a dependency parser to segment sentences into SubjPredObject

● Subjects and Objects represent compound entities

● Use corpus statistics to predict constituents of compound entities

Page 12: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

12

Extraction Algorithm

Relationship head

Subject head

Object head Object head

Page 13: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

13

Preliminary results

Page 14: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

14

Extracted Triples

Page 15: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

15

Predicting constituents

● Greedy mutual information based word grouping used to predict constituents

● Mutual Information: a measure for discovering interesting word collocations information that two random variables share: it

measures how much knowing one of these variables reduces our uncertainty about the other

Page 16: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

16

Representation – Resulting RDF

ModifiersModified entitiesComposite Entities

estrogen

An excessive endogenous or

exogenous stimulation

modified_entity1composite_entity1

modified_entity2

adenomatous hyperplasia

endometrium

hasModifier

hasPart

induces

hasPart

hasPart

hasModifier

hasPart

Page 17: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

17

Inspiration Revisited

Text

Extraction of Semantics from text

Assigning interpretation to text

Semantic Metadata Guided

Knowledge Exploration

Semantic Metadata Guided

Knowledge Discovery

Triple-basedSemantic

Search

Semanticbrowser

Subgraphdiscovery

Semantic metadata in the form ofsemi-structured data

[EKAW'08][accepted WI'08] [SIGKDD'05][EKAW'08]

[ISWC'06]

QueryFormulation

[ICSC'08]

Page 18: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

References

● C. Ramakrishnan, W. H. Milnor, M. Perry and A. P. Sheth "Discovering Informative Connection Subgraphs in Multi-relational Graphs". In SIGKDD Explorations 7(2): 56-63 (2005)

● Cartic Ramakrishnan, Krys Kochut, Amit P. Sheth: A Framework for Schema-Driven Relationship Discovery from Unstructured Text. International Semantic Web Conference 2006: 583-596

● “Joint Extraction of Compound Entities and Relationships to support Semantic Browsing over Biomedical Literature” Cartic Ramakrishnan, Pablo N. Mendes, Rodrigo A.T.S da Gama, Guilherme C. N. Ferreira & Amit P. Sheth. Web Intelligence WI'08, Australia.

● "Unsupervised Discovery of Compound Entities for Relationship Extraction" Cartic Ramakrishnan, Pablo N. Mendes, Shaojun Wang and Amit P. Sheth. 16th International Conference on Knowledge Engineering and Knowledge Management Knowledge Patterns. EKAW'08, Sicily, Italia.

Page 19: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

19

Pablo N. MendesPablo N. Mendesmendes.2 @ wright.edumendes.2 @ wright.edu http://knoesis.wright.eduhttp://knoesis.wright.edu

Thank you!Thank you!

Page 20: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

10/08/08 20

Challenge: Compound Entities

● Structurally complex Nested Overlapping Discontinuous

● “Semantically” complex Ontology term + Modifiers Compositions of ontology terms

● Large number of possible combinations of terms: low probability an ontology would contain all of

them

Page 21: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

21

Paths between Migraine and Magnesium

Paths are considered interesting if they have one or more named relationshipOther than hasPart or hasModifiers in them

Page 22: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

22

An example of such a path

platelet(D001792)

collagen(D003094)

migraine(D008881)

magnesium(D008274)

me_3142by_a_primary_abnormality_of_platelet_behavior

me_2286_13%_and_17%_adp_and_collagen_induced_platelet_aggregation

caused_by

hasPart

hasPart

stimulated

stimulatedhasPart

Page 23: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

23

Evaluation conducted using this toolEvaluation on OMIM RDF

1938 triple-sentence comparisons

2 evaluators (not domain experts)

Currently system in use by domain experts at

● CCHMC (2 experts)● Awaiting results

Page 24: Unsupervised Discovery of Compound Entities for Relationship Extraction Cartic Ramakrishnan, Pablo N. Mendes Shaojun Wang, Amit P. Sheth mendes.2@wright.edu

24

Results of Manual Evaluation