annotating documents for the semantic web using data-extraction ontologies

Annotating Documents for Annotating Documents for the Semantic Web the Semantic Web

Using Data-Extraction Using Data-Extraction OntologiesOntologies

Annotating Documents for Annotating Documents for the Semantic Web the Semantic Web

Using Data-Extraction Using Data-Extraction OntologiesOntologies Dissertation ProposalDissertation Proposal

Yihong DingYihong Ding

2

Motivation• The representation of web content

limits its usability

• A machine understandable web– Shared, explicit, formal

conceptualizations (ontologies)– The semantic web

3

A Problem

• How to transform current web to be the semantic web?

4

A Solution: Semantic Annotation

• Add explicit, formal, and unambiguous metadata to web documents

• Explicit: publicly accessible• Formal: publicly agreeable• Unambiguous: publicly identifiable

5

Annotation Representation

Explicit Annotation

Implicit Annotation

6

Semantic Annotation Current Research Status

• Manual annotation through friendly interfaces [Annotea, etc.]

• Automatic annotation with ontology generation [SCORE]

• Automatic annotation using automated IE tool based on pre-defined ontologies [SemTag, MnM, etc.]

7

Current Automatic Annotator

a typical paradigm

Domain OntologyNon-ontology-based IE

Wrapper

Rules and extracting categories

Document

(1) Extraction

(2) Alignment

(3) Annotation

8

Current Automatic Annotator

Problems

Domain Ontology

Document

(1) Problem of data recognition

(2) Problem ofconcept disambiguation

(3) Problem of Annotation formatting,storing, indexing, sharing

(4) Problem of Assembling ontologies

Non-ontology-based IE Wrapper

Rules and extracting categories

9

“Main Drawback of Using Automated IE”

[Kiryakov04]

• “none of these approaches expects an input or produces output with respect to ontologies”

• “a set of heuristics for post-processing and mapping of the IE results to an ontology … not sufficient for large-scale, domain-independent semantic annotation.”

• “IE and wrapper induction techniques need to use the ontology more directly during the process of extraction.”

10

Ontology-driven Paradigm

(Data-Extraction Ontology)

for Semantic Annotation

Document

Non-ontology-based

IE Wrapper

Ontology-basedIE Wrapper

Document

11

Ontology-driven Paradigmfor Semantic Annotation

Some Arguments

• Resiliency w.r.t. web page layouts (helps scale to large set of web pages)

• Adpativeness w.r.t. domain specifications (helps scale to large size domains)

• Creation of ontologies: still a problem but no longer a drawback

• Speed of execution: still a drawback (but we are going to propose a solution next)

12

Two-Layer Annotation Model

Conceptual Annotator using an

ontology-based IE tool

DocumentStructuralAnnotator

SampleAnnotationProcess

SimilarDocumentsMassive

AnnotationProcess

13

Structural Annotator• Major components

– HTML hierarchical path that leads to concept locations

– Local context around locations– Dependencies among multiple semantic

categories

• Significance– Identify both categories and their semantic

meanings

14

Ontology Factors in Semantic Annotation

Tasks• Knowledge specification

– Semantic web community– Web Ontology Language (OWL)

• Knowledge instantiation– IE and database community– Object-oriented System Model in XML

(OSMX)

15

Ontology Conversion• Similarities (OWL vs. OSMX)

– Class vs. object set– ObjectProperty vs. relationship set– Cardinality restriction vs. participation constraint– subclassOf vs. is-a relationship

• Unique features– OWL

• subpropertyOf• symmetric and transitive property• namespace declaration• ontology importing

– OSMX• arbitrary n-ary relationship sets• data frames• general constraints

16

Ontology Construction An Unavoidable Problem

• Semantic annotation tasks require ontologies.

• The ontology for a specific semantic annotation task is not promised to be available all the time.

17

Ontology Construction General and Special

• Generally speaking– Until now, main stream, manual construction – Automatic and semi-automatic ontology generation,

many research papers, few or none practical, a very hard problem

• Special to semantic annotation purpose– Very dynamic and variant domains– Much overlapped information– Limited size of scope for one web page– Flat structure

18

Ontology Construction Knowledge Reusing

• “What has been will be again, what has been done will be done again; there is nothing new under the sun.” (The Holy Bible, Ecclesiastes, 1:9, NIV translation)

• A “new” ontology is a new assembly with unions and projections of several pre-existed ontologies.

19

Architecture on Dynamically Assembling

Domain of Interest

Web Page

(1)

(2)

(1) Knowledge-component selection

(2) Ontology assembly

……

Collection of KnowledgeSelected Knowledge Components

…

Assembled Ontology

…

20

Thesis StatementPropose a new solution to perform semantic annotation on normal HTML web pages, specifically

1. apply ontology-based automatic IE techniques

2. augment OWL with knowledge recognition extension

3. combine conceptual annotator and layout-based annotator

4. assemble a new domain ontology for an annotation task dynamically

21

Standard Evaluation• Annotation performance

– Precision– Recall– Speed of execution

• Testing bed– 5 ~ 10 different domains, with over 10

lexical concepts in each domain ontology– 20 ~ 50 web pages on each domain

22

Ontology Converter Test

• A complete and sound checking is costly and difficult to implement.

• Our simple test– Start with an OSMX ontology AA– Covert it to OWL and then transform it back to be

OSMX ontology BB– Process both AA and BB to annotate a same set of web

pages (say 30 – 50 web pages)– Annotation results should be identical

23

Two-Layer Annotation Model Evaluation

• Standard evaluation

• In addition– About five large web sites with

machine-generated web pages, each of which contains at least dozens of web pages

24

Dynamic Ontology Assembler Evaluation

• Regular precision and recall study according to selected knowledge components

• A pilot study on when ontology assembler works better than manual ontology construction– Record the time to use a tool to create an ontology

from scratch– Record the time to assemble a same ontology– Compare their differences and the special conditions

for each case– Make empirical suggestions about how to build a

knowledge base that favors ontology assembly

25

Delimitations• Automatic ontology creation from scratch

• Annotation storing, indexing, and sharing mechanisms

• Semantic annotation for multimedia content

• Parallel or distributional computing to further scale the semantic annotation system to a large number of web pages

26

Contributions• To convert current web pages into machine-understandable semantic

web pages

• Producing a pure ontology-driven semantic annotator using ontology-based IE wrapper

• Proposing a novel two-layer annotation model to do fast, accurate, and resilient annotation

• Studying a dynamic ontology assembler that helps maximize the reuse of existing knowledge and minimize the load of manual ontology creation

• Implementing an ontology converter so that this work is useful to the rest of the semantic web society.

annotating documents for the semantic web using data-extraction ontologies

Documents

current web

domainindependent semantic

web documentsexplicit

ontology construction

web page layouts

manual construction

process of extraction

predefined ontologies