maintaining information integration ontologies georgios paliouras, alexandros valarakos, georgios...

21
Maintaining Information Integration Maintaining Information Integration Ontologies Ontologies Alexandros Valarakos, Georgios Paliouras, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios Vouros Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications NCSR “Demokritos” http://www.iit.demokritos.gr/skel DCAG, Ulm, December 6, 2003

Upload: brianne-bridges

Post on 14-Jan-2016

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

Maintaining Information Integration OntologiesMaintaining Information Integration Ontologies

Alexandros Valarakos, Georgios Paliouras, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios Vouros

Software & Knowledge Engineering Lab

Inst. of Informatics & TelecommunicationsNCSR “Demokritos”

http://www.iit.demokritos.gr/skel

DCAG, Ulm, December 6, 2003

Page 2: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

6/12/2003 Maintaining Information Integration Ontologies2

DCAG, Ulm

Structure of the talkStructure of the talk

• Information integration in CROSSMARC• Semi-automated ontology enrichment• Clustering “synonyms”• Conclusions

Page 3: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

6/12/2003 Maintaining Information Integration Ontologies3

DCAG, Ulm

CROSSMARC ObjectivesCROSSMARC Objectives

• crawl the Web for interesting Web pages,• extract information from pages of different sites without

a standardized format (structured, semi-structured, free text),

• process Web pages written in several languages,• be customized semi-automatically to new domains and

languages,• deliver integrated information according to personalized

profiles.

Develop technology for Information Integration that can:

Page 4: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

6/12/2003 Maintaining Information Integration Ontologies4

DCAG, Ulm

CROSSMARC ArchitectureCROSSMARC Architecture

Ontology

Page 5: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

6/12/2003 Maintaining Information Integration Ontologies5

DCAG, Ulm

CROSSMARC OntologyCROSSMARC Ontology

• Meta-conceptual layer• Embodies domain-independent semantics

• Conceptual layer• Contains relevant concepts of each domain

• Instance layer• Contains relevant individuals of each domain

• The lexical layer • Language dependent realizations of domain

information

Page 6: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

6/12/2003 Maintaining Information Integration Ontologies6

DCAG, Ulm

CROSSMARC OntologyCROSSMARC Ontology

…<description>Laptops</description> <features> <feature id="OF-d0e5"> <description>Processor</description> <attribute type="basic" id="OA-d0e7"> <description>Processor Name</description> <discrete_set type="open"> <value id="OV-d0e1041"> <description>Intel Pentium 3</description> </value> …

<node idref="OV-d0e1041">  <synonym>Intel Pentium III</synonym>   <synonym>Pentium III</synonym>   <synonym>P3</synonym>   <synonym>PIII</synonym></node>

Lexicon

Ontology

<node idref="OA-d0e7">

  <synonym>Όνομα Επεξεργαστή</synonym>

</node>

Greek Lexicon

Page 7: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

6/12/2003 Maintaining Information Integration Ontologies7

DCAG, Ulm

Structure of the talkStructure of the talk

• Information integration in CROSSMARC• Semi-automated ontology enrichment• Clustering “synonyms”• Conclusions

Page 8: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

6/12/2003 Maintaining Information Integration Ontologies8

DCAG, Ulm

Ontology EnrichmentOntology Enrichment

An ontology captures knowledge in a static way, as it is a snapshot of knowledge from a particular point of view that governs a certain domain of interest in a specific time-period.

Evolving nature of Evolving nature of ontologyontology OntologyOntology MaintenanceMaintenance

OntologyOntology EnrichmentEnrichment

part of

Instances

Conceptualization

T-box A-box

Page 9: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

6/12/2003 Maintaining Information Integration Ontologies9

DCAG, Ulm

Ontology EnrichmentOntology Enrichment

• Highly evolving domain (e.g. laptop descriptions)– New Instances characterize new concepts.

e.g. ‘Pentium 2’ is an instance that denotes a new concept if it doesn’t exist in the ontology.

– New surface appearance of an instance.

e.g. ‘PIII’ is a different surface appearance of ‘Intel Pentium 3’

• We concentrate on instances (knowledge of the domain of interest).

• The poor performance of many Information Integration systems is due to their incapability to handle the evolving nature of the domain they cover.

Page 10: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

6/12/2003 Maintaining Information Integration Ontologies10

DCAG, Ulm

Ontology EnrichmentOntology Enrichment

Multi-Lingual Domain Ontology

Additional annotations

Validation

Ontology Enrichment / Population

Domain Expert

Annotating Corpus Using Domain Ontology

Information extraction

machine learning

Corpus

Page 11: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

6/12/2003 Maintaining Information Integration Ontologies11

DCAG, Ulm

Results: Annotation phase onlyResults: Annotation phase only

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

25% 50% 75%

% of Ontology

RE

CA

LL

Union

Ontology

HMM

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

75% 50% 25%

% of Ontology

PR

EC

ISIO

N Union

Ontology

HMM

Page 12: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

6/12/2003 Maintaining Information Integration Ontologies12

DCAG, Ulm

Results: Full enrichment cycleResults: Full enrichment cycle

 Initial Instances

Target Instances Iter-0 Iter-1 Iter-2

processorName 6 15 3 4 2

cdromSpeed 5 8 3 -  -

screenResolution 3 7 2 1 1

Ram 4 8 3 0  -

Processor Speed 6 12 6 -  -

HDD 4 8 3 0  -

 Initial

InstancesTarget

Instances Iter-0 Iter-1 Iter-2

processorName 8 15 4 3 -

cdromSpeed 6 8 2 -  

screenResolution 5 7 2 -  

RAM 6 8 2 -  

Processor Speed 9 12 2 0  

HDD 6 8 2 -  

Initial Instances

Target Instances

Iter-0 Iter-1 Iter-2

Processor Name 3 15 3 4 3

Cdrom Speed 2 8 3 3 -

Screen Resolution 2 7 0 - -

RAM 2 8 5 0 -

Processor Speed 4 12 7 0  -

HDD 2 8 5 0  -

25% of the initial ontology

50% of the initial ontology

75% of the initial ontology

Page 13: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

6/12/2003 Maintaining Information Integration Ontologies13

DCAG, Ulm

Structure of the talkStructure of the talk

• Information integration in CROSSMARC• Semi-automated ontology enrichment• Clustering “synonyms”• Conclusions

Page 14: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

6/12/2003 Maintaining Information Integration Ontologies14

DCAG, Ulm

Enrichment with synonymsEnrichment with synonyms

• The number of instances for validation increases with the size of the corpus and the ontology.

• So far, only enrichment with instances that participate in the ‘instance of’ relationship has been supported.

• There is a need for supporting the enrichment of the ‘synonymy’ relationship (in different languages and domains).

ONTOLOGY LEARNING

We approach this problem using …

Page 15: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

6/12/2003 Maintaining Information Integration Ontologies15

DCAG, Ulm

Enrichment with synonymsEnrichment with synonyms

• Discover automatically different surface appearances of an instance (CROSSMARC synonymy relationship).

Synonym : ‘Intel pentium 3’ - ‘Intel pIII’

Orthographical : ‘Intel p3’ - ‘intell p3’

Lexicographical : ‘Hewlett Packard’ - ‘HP’

• Issues to be handled:

Combination : ‘Intell Pentium 3’ - ‘P III’

Page 16: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

6/12/2003 Maintaining Information Integration Ontologies16

DCAG, Ulm

Compression-based ClusteringCompression-based Clustering

• COCL (COmpression-based CLustering): a model based algorithm that discovers typographic similarities between strings (sequences of elements-letters) over an alphabet (ASCII characters) employing a new score function CCDiff.

• CCDiff is defined as the difference in the code length of a cluster (i.e., of its instances), when adding a candidate string. Huffman trees are used as models of the clusters.

• COCL iteratively computes the CCDiff of each new string from each cluster implementing a hill-climbing search. The new string is added to the closest cluster, or a new cluster is created (threshold on CCDiff ).

Page 17: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

6/12/2003 Maintaining Information Integration Ontologies17

DCAG, Ulm

Compression-based ClusteringCompression-based Clustering

Given CLUSTERS and candidate INSTANCESwhile INSTANCES do for each instance in INSTANCES compute CCDiff for every cluster in CLUSTERS

end for each select instance from INSTANCES that maximizes the difference between its two smallest CCDiff’s if min(CCDiff) of instance > threshold create new cluster assign instance to new cluster remove instance from INSTANCES calculate code model for the new cluster add new cluster to CLUSTERS

else assign instance to cluster of min(CCDiff) remove instance from INSTANCES recalculate code model for the cluster

end while

Page 18: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

6/12/2003 Maintaining Information Integration Ontologies18

DCAG, Ulm

Results - EvaluationResults - Evaluation

• Concept Generation Scenario

Instances kept (%) Correct Accuracy (%)

90 3 100

80 11 100

70 15 100

60 19 100

50 23 95,6

40 29 96,5

30 34 94,1

• Instance Matching Scenario

We hide incrementally one cluster at a time and measure the ability of the algorithm to discover the hidden clusters

Cluster’s Name Cluster’s Type Instances

Amd Processor Name 19

Intel Processor Name 8

Hewlett-Packard Manufacturer Name 3

Fujitsu-Siemens Manufacturer Name 5

Windows 98 Operating System 10

Windows 2000 Operating System 3

Dataset characteristics

Recall : 100%

Precision : 75%

Page 19: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

6/12/2003 Maintaining Information Integration Ontologies19

DCAG, Ulm

Structure of the talkStructure of the talk

• Information integration in CROSSMARC• Semi-automated ontology enrichment• Clustering “synonyms”• Conclusions

Page 20: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

6/12/2003 Maintaining Information Integration Ontologies20

DCAG, Ulm

ConclusionsConclusions

• CROSSMARC is a complete multi-lingual information integration system.

• Ontology Maintenance is crucial in evolving domains.• Ontology Enrichment helps the adaptation of the

system to new domains saving time and effort.• Machine-learning based information extraction can

assist the discovery of new instances.• Compression-based clustering discovers string

similarities that support the enrichment with different surface appearances of an instance (“synonyms”).

Page 21: Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios

6/12/2003 Maintaining Information Integration Ontologies21

DCAG, Ulm

ReferencesReferences

1) B. Hachey, C. Grover, V. Karkaletsis, A. Valarakos, M. T. Pazienza, M. Vindigni, E. Cartier, J. Coch, Use of Ontologies for Cross-lingual Information Management in the Web, In Proceedings of the Ontologies and Information Extraction International Workshop held as part of the EUROLAN 2003, Romania, July 28 - August 8, 2003

2) M. T. Pazienza, A. Stellato, M. Vindigni, A. Valarakos, V. Karkaletsis, Ontology Integration in a Multilingual e-Retail System, In Proceedings of the HCI International Conference, Volume 4, pp. 785-789, Heraklion, Crete, Greece, June 22-27 2003.

3) A. Valarakos, G. Sigletos, V. Karkaletsis, G. Paliouras, A Methodology for Semantically Annotating a Corpus Using a Domain Ontology and Machine Learning, In RANLP, 2003

4) A. Valarakos, G. Sigletos, V. Karkaletsis, G. Paliouras, G. Vouros, A Methodology for Enriching a Multi-Lingual Domain Ontology using Machine Learning, In Proceedings of the 6th ICGL workshop on Text Processing for Modern Greek: from Symbolic to Statistical Approaches, held as part of the 6th International Conference in Greek Linguistics, Rethymno, Crete, 20 September, 2003.

5) A. Valarakos, G. Paliouras, V. Karkaletsis, G. Vouros, A Name-Matching Algorithm for Ontology Enrichment, In Proceedings of the Hellenic Artificial Intelligence Conference (SETN’04), Samos, May, 2004.