using ontologies to enable access to multiple heterogeneous databases cardgis

15
Using Ontologies to Enable Using Ontologies to Enable Access to Multiple Access to Multiple Heterogeneous Databases Heterogeneous Databases CARDGIS CARDGIS Eduard Hovy Information Sciences Institute University of Southern California (in collaboration with Columbia University)

Upload: thea

Post on 23-Jan-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS. Eduard Hovy Information Sciences Institute University of Southern California (in collaboration with Columbia University). Context: CARDGIS Project. Sources: Energy Info. Adminstration (quarterly CD ROM). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS

Using Ontologies to Enable Access to Using Ontologies to Enable Access to Multiple Heterogeneous DatabasesMultiple Heterogeneous Databases

CARDGISCARDGIS

Eduard Hovy

Information Sciences InstituteUniversity of Southern California

(in collaboration with Columbia University)

Page 2: Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS

CARDGIS 2

Context: CARDGIS ProjectContext: CARDGIS Project

Sources:– Energy Info. Adminstration (quarterly CD ROM).– Bureau of Labor Statistics (http://stats.bls.gov).– Census Bureau (CD ROM for 1992 data).– California Energy Commission (weekly data at

http://energy.ca.gov).

Enable access to multiple, heterogeneous Federal agency data sources through single interface using standardized nomenclature, while accounting for semantic variability.

Page 3: Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS

CARDGIS 3

System ArchitectureSystem Architecture

Sources

Integrated Ontology- global terminology- source descriptions- integration axioms

User Interface- ontology browser- query constructor

User phase:• Compose query

Ontology Construction- DB analysis- text analysis

Constructionphase:• Deploy DBs• Extend ontol.

Query Processor- reformulation- cost optimization R S T

Access phase:• Create DB query• Retrieve data

Page 4: Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS

CARDGIS 4

So What is an Ontology?So What is an Ontology?

Desiderata:– ‘anchor points’ for terminology variants

(salary, income…),– wide coverage,– some degree of taxonomic organization

for inference/program behavior control.Terminological (not domain) ontology.

Page 5: Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS

CARDGIS 5

Taxonomy, multiple superclass links.Approx. 90,000 items. Top level: Penman Upper Model (ISI).Body: WordNet (Princeton), rearranged.Used at ISI for machine translation, text

summarization, database access.

http://vigor.isi.edu:8002/sensus2/http://vigor.isi.edu:8002/sensus2/

ISI’s SENSUS OntologyISI’s SENSUS Ontology

Page 6: Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS

CARDGIS 6

3 Ways of Building Ontologies3 Ways of Building Ontologies

1. Combine existing knowledge resources: ontology alignment.

+ +

2. Learn from texts and Web: extract word families for thousands of concepts.

3. Parse dictionary definitions: extract information and place into ontology.

Page 7: Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS

CARDGIS 7

1. Cross-Ontology Alignment 1. Cross-Ontology Alignment

1. Text Matches– concept names (cognates; reward for delimiter confluence...)– textual definitions (string matching, demorphing, stop words...)

[Knight & Luk 94, Dalianis & Hovy 98]

2. Hierarchy Matches– shared superconcepts, to filter ambiguity [Knight & Luk 94]– semantic distance [Agirre et al. 94]

3. Data Item and Form Matches– inter-concept relations [Ageno et al. 94; Rigau & Agirre 95]– slot-filler restrictions [Okumura & Hovy 94]

Why create a new Ontology? — Merge and re-use existing ones!

Problem: automatically find corresp. concepts.

Page 8: Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS

CARDGIS 8

Cross-Ontology Alignment ResultsCross-Ontology Alignment Results Ontologies:

– SENSUS Upper Model (350)– CYC top region (2400) [Lenat; Lehmann 96]– MIKROKOSMOS (4790 concepts) [Mahesh 96]– SENSUS top region (6768)

Recall (how many links were missed?):

difficult to count! … 32.4 mill pairs Precision (how many suggested links are correct?):

– 0.252 (strict)– 0.517 (lenient)

After 5 runs: correct: 244 (= 3.6%)– 883 suggestions near miss: 256 (= 3.8%)

(= 13% of SENSUS candidates) wrong: 383 (= 5.6%)

cutoff 1.4 10 7.8 12 15new heuristic NAME/DEF/TAX TAX TAX TAX TAXtotal 187 151 170 218 241correct 73 11 18 36 106near 51 92 51 60 2wrong 63 48 101 122 39

1996

1997

Page 9: Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS

CARDGIS 9

2. The Websucker2. The Websucker

• Corpus– Training set WSJ 1987:

• 16,137 texts (32 topics).

– Test set WSJ 1988:• 12,906 texts (31 topics).

– Texts indexed into categories by humans.

• Signature data– 300 terms each, using tf.idf .

– Word forms: single words, demorphed words, multi-word phrases.

RANK ARO BNK ENV TEL1 contract bank epa at&t2 air_force thrift waste network3 aircraft banking environmental fcc4 navy loan water cbs5 army mr. ozone cable6 space deposit state bell7 missile board incinerator long-distance8 equipment fslic agency telephone9 mcdonnell fed clean telecomm.

10 northrop institution landfill mci11 nasa federal hazardous mr.12 pentagon fdic acid_rain doctrine13 defense volcker standard service14 receive henkel federal news

• How many terms in signatures?– 5,10,15, …, 300 terms.

Page 10: Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS

CARDGIS 10

Pollution on the WebPollution on the Web

Cleanup: try various methods: tf.idf, 2, Latent Semantic Analysis...

<AIRCRAFT, w=207.998>

<ENGINE, w=178.677>

<WING, w=138.36>

<PROPELLER, w=122.317>

<FLY, w=103.187>

<AIRPLANE, w=98.0431>

<AVIATION, w=96.5663>

<FLIGHT, w=85.3079>

<AIR, w=80.1996>

<WARBIRDS, w=72.4247>

<PILOT, w=71.4707>

<MPH, w=65.987>

<CONTROL, w=65.9729>

<FUEL, w=62.3078>

<MORTICE,w=33.7982>

<WOODWORKING, w=20.9227>

<TENNON, w=20.9227>

<JOINERY, w=17.7038>

<WOOD, w=15.8356>

<HARDWOOD, w=14.4849>

<JASON, w=14.4849>

<DOTH, w=12.8755>

<BRASH, w=12.8755>

<OAK, w=12.8281>

<WEDGE, w=11.9118>

<FURNITURE, w=10.0792>

<TOOL, w=9.19486>

<SHAFT, w=8.17321>

<STAR, w=75.1358>

<ORION,w=55.8937>

<PYRAMID,w=42.1494>

<DNA,w=41.2331>

<SOUL,w=31.1539>

<IMPLOSION,w=23.8236>

<KHUFU,w=19.3133>

<GOLD,w=18.3897>

<RECURSION,w=18.3258>

<BELLATRIX,w=17.7038>

<OSIRIS,w=17.7038>

<PHI,w=16.4932>

<EMBED,w=16.4932>

<MAGNETIC,w=16.4932>

Page 11: Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS

CARDGIS 11

3. Dictionary Extraction3. Dictionary Extraction

<hw>Babel</hw> <pos>n</pos> <sn>2</sn>[ SENT [ NP OR [ NP A/DT place/NN ] [ NP scene/NN ] ] [ PP of/IN [ NP AND [ NP noise/NN ] [ NP confusion/NN ] ] ] ] ;/: [ SENT [ NP a/DT confused/JJ mixture/NN ] [ PP of/IN [ NP sounds/NNS ] ] ,/, as/IN [ PP of/IN [ NP languages/NNS ] ] ] ./.

Step 1: find unencumbered dictionary (Webster 1913). Step 2: reformat and then parse entries

(http://www.isi.edu/natural-language/dpp/).

Step 3: identify individual propositions and their heads.

Step 5: place entries into ontology (not yet done).

Step 4: convert preps to semantic relations (EM alg).

Page 12: Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS

CARDGIS 12

Identify propositions and their parts:Impression: “A communicating

[of a mold or trait] [by an external force or influence]”

Reflection: “The return [of light or sound waves] [by or as if by a mirror]”

by = AGENT or PATH?communication by force; return by mirror; return by road

of = OWNER or NUMBER-PART or SOURCE or …?the car of Joe; 1 of 15 people smoke; man of La Mancha

Apply EM algorithm to disambiguate.

Disambiguating Extracted Info.Disambiguating Extracted Info.

Page 13: Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS

CARDGIS 13

Dictionary Extraction ResultsDictionary Extraction Results

Ambiguity reduction

Readings Instances

60 148 136 124 118 712 810 26 7645 124 203 1082 3101 902

Evaluation for sentence #1: "As a prefix to english words."

0.000000000621871299: NIL relation<abst PHRASAL speech_act

Score: 1/1 = 1

Evaluation for sentence #13: "Gives up to underwriters."

0.000000041080864587: create,make NIL RECIPIENT capitalist<so

0.000000038652300894: transmit_thou NIL RECIPIENT capitalist<so

Score: 1/2 = 0.5

Evaluation for sentence #14: "Gives all claim to the property."

0.000000002594561718: emit,utter human_action PHRASAL possessn>tr

0.000000002564569212: chnge_pos human_action PHRASAL possessn>tr

0.000000002451809783: create,make human_action PHRASAL possessn>tr

0.000000002368122454: cogitate human_action PHRASAL possessn>tr

0.000000002366411877: utilize human_action PHRASAL possessn>tr

0.000000002307022303: transmit_thou human_act PHRASAL possessn>tr

0.000000002177555675: transfer>comm human_act PHRASAL possessn>tr

0.000000002049017956: chnge>go_mad human_act PHRASA possessn>tr

Score: 1/8 = 0.125

Page 14: Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS

CARDGIS 14

The Future: Terminology Standard?The Future: Terminology Standard?

Reasons for terminology standardization: 1. Non-duplication

similar domain models built for many applications 2. Consistency

across experts within domain, and across domains 3. Efficient model building

complex: many decisions required simultaneously

ANSI Ad Hoc Group on Ontology Standards (NCITS): draw together Ontology work worldwide

IBM (Santa Teresa), Stanford, ISI, CYC, TextWise, EDR, CSLI, NMSU, Lawrence Livermore, OnTek,

Government...

Meetings: 3/96, 9/96, 3/97, 11/97, 1/98, (6/98)…

Page 15: Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS

CARDGIS 15

Questions?