database and information system … on web the term information system refers to a system of...

Basi di Dati e Sistemi Informativi su Web

Prof. Massimo Ruffolo

Ing. Ermelinda Oro

UNICAL - A.A. 2008-2009

DataBase and Information System … on Web The term information system refers to a system of

persons, data records and activities that process the data and information in an organization, and it includes the organization's manual and automated processes.

A database is a structured collection of records or data that is stored in a computer system. The structure is achieved by organizing the data according to a database model. The model in most common use today is the relational model.

Querying unstructured sources

Querying unstructured sources Structure query over unstructured document

Extract/Select/Annotate politicianNews

From http://...

Where politicianNews(X,Y,Z),

Z:politician(name:N),

N=hillaryClinton

[Fill database uri]

This kind of query can be executed over database or unstructured document. Only the rewriting strategy changes

Information extraction and Annotation

Information extraction (IE): enables to acquire information contained in unstructured documents and store them in structured forms

Current Web into a Semantic Web requires automatic approaches for annotation of existing data since manual annotation approaches will not scale in general. More scalable semi-automatic approaches known from ontology learning deal with extraction of ontologies from texts (also in tabular form).

An ontology-based system for information

extraction from semi and unstructured Web

Documents

Motivations Existing IE approaches mainly exploits syntactic structure of

information and not its actual semantics

Much work on IE from HTML documents: There is not a unique winning approach Extraction rules are able to identify tabular information only when such a

structure is explicitly declared Variability of HTML language and the use of Cascading Style Sheet

technology, produce classic HTML approaches not robust

Too little work on IE from PDF documents: No ontology-based approaches Existing Table Recognition approaches and information extraction

follow distinct scope

State of Art

Existing Approaches

and Systems

Manual approches

TSIMMIS

Minerva

FLORID

Supervised Approches

RAPIER

STALKER

SoftMealy

NoDoSe

Unsupervised Approaches

STAVIES

RoadRunner

NLP-oriented

system

RAPIER

TextRunner

SnowBall

PDF-oriented

approachesFlesca et Al.

(Fuzzy System) 06

Gottlob et Al. 06

Document Understanding

techniques

PDF Document: the standard format for document publication, sharing and exchange IE from Adobe Portable Document Format (PDF)

One of the most diffused unstructured document format PDF documents are completely unstructured and their

internal encoding is visualization-oriented The PDF document description language represents a

PDF document as a collection of 2-dimensional typographic elements contained in content streams

Traditional wrapping/IE systems cannot be applied

Information Extraction from Documents by means extraction rules that:

i. Exploit a human-oriented document representation: 2-dimensional representation

ii. Exploit semantics of the information represented in a Knowledge Base

iii. Directly Populate (enrich) the Knowledge Base with the Extracted Information

iv. Handle both natural language and document structures (by exploiting embedded Table Recognition Approach)

v. Allow (Semantic) annotation of unstructured sources for enabling semantic classification and search

Proposed Approach

To exploit semantics represented in a Knowledge Base To recognize information (when they are organized in

both textual and tabular form) To directly store extracted information in the Knowledge

allowthatRulesExtraction

FormalismionepresentatRKnowledge

PatternsbasedGrammarsAttribute

ionepresentatRDocumentlDimensiona

2-Dimensional Document Representation Semantic given by the position

Value about

Operating revenues

Obtained in 2007 year

Internal Document Representation:Input Document

2-Dimensional Document Representation: Document Portion

(0,0) X

2-Dimensional Document Representation: Document Portion

(0,0) X

(1,32)

(4,33)

2-Dimensional Document Representation:Document Portion

(1,32)

(4,33)

33,432,1

WarningTornado

21,, vv

Portioning Process

Attribute Grammars

Example: math expression

E → [+ | −] T [ (+ | −) T ]*

T → F [ (* | /) F]*

F → NUM | (E)

An attribute for each symbol of the grammar and local attributes used as aid. So, the semantic action allow to compute the value of the expression:

E → {double E.ris; int segno =1;} [+ | − {segno= −1;} ] T1 {E.ris=segno*T1.ris;}

[ (+ {segno=1;} | − {segno=−1;}) T2 {E.ris=E.ris+segno*T2.ris;} ]*

T → {double T.ris; int oper;} F1 {T.ris=F1.ris;} [ (* {oper=1;} | / {oper=2;} )

F2 {T.ris=(oper==1)?T.ris*F2.ris : T.ris/F2.ris;}]*

F → {double F.ris;} NUM {F.ris=NUM.val;} | (E) {F.ris=E.ris;}

redPFuncAttSVVΠAG

GrammarAttribute

TN ,,,,,,

Simple Extraction Patterns: regex Recognize a float number

\d+(\.\d{2})?

Mail address: ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$(C|c)ittà

,,,,,,,,

IRCADO

ogyOntol

Knowledge Representation Formalism

ID name

ID name population inState

chicago “Chicago” 2833321 illinois

cityClimate

Self-Describing/Populating Ontology (SDO)

A SDO is an ontology in which objects and classes can be equipped by a set of rules named descriptors.

Descriptors are object-oriented grammatical rules that: Allow to recognize and extract objects from documents

and populate classes with new extracted objects Exploit Knowledge contained in OOKB for the extraction Can exploit each other in describing more complex objects

Descriptors

Class Descriptors that handle 2-D capabilities:

class weatherRecord( wCity:city, wWarns:warnings,Temp:temperature, wHumid:percentage, wPress:pressure, wDescr:weatherDescription, wWind:wind).

<weatherRecord(C,Wa,T,H,P,D,Wi)> -> <X:city()>{C:=X;} (<X:warnings()>{Wa:=X;})? <X:temperature()>{T:=X;} <X:percentage()>{H:=X} <X:pressure()>{P:=X;}

<X:wind()>{Wi:=X;} 2D-BOTH.

<X:weatherDescription()>{D:=X;}

General or Domain Specific Knowledge

The system architecture

Attribute Transition Network (ATN) implemented as logic programs in OntoDLP Language

The system architecture

Direct use of Chart Parsing Algorithms for AG parsing

The system architecture: 2-D matcher

Direct use of Chart Parsing Algorithms for AG parsing

database and information system … on web the term information system refers to a system of...

tabular information

systemsdrawbacksi information

unstructured documents

unstructured web documents

html documents

webthe term information

extraction of ontologies

classic html approaches

Documents

data-oriented architecture -...

information system architecture information system the data...

international nuclear information system (inis) echnology...

export data from student information system

information system & data processing

statewide longitudinal data system (slds) student...

the accounting information system chapter 3. the accounting...

immunization information system flat file data import

management information system data element dictionary

the nasa earth observing system data and information system...

data archiving, table analysis, archive information system

emergency department information system (edis) – data...

human resource management information system project: data

intrinsic landscape aesthetic resource information system...

data information and management system user prespective

integrated data and information system and its

modis information, data, and control system (midacs) system...

data management student information system project overview

replacement laboratory information management system …...

homeless management information system (hmis) - · pdf...