icait2011 submission 5

8/3/2019 Icait2011 Submission 5

1/3

ISSN:2010-460X

Proceedings of 2011 International Conference on Advancements in Information

Technology (ICAIT 2011)

Chennai, India, 17-18 December, 2011,

An algorithmic Schema Matching Technique for Information Extraction on the Web

Suresh Jain1, C. S.Bhatia

2

2SIA,Indore

1KCBTA, Indore

Abstract. World today is facing the challenge of handling and managing information expansion, which isexpanding at exponential rate. Search engines are powerful to extract HTML based information stored over thewebsites. However, the search engines have limitations extracting information stored in backend database dueto schema matching problems or security reasons.On the web there are so many problems still to solve, so that we can not make our searches precise & relevant

to context. These problems fall in different categories. First problem is technical problem. These technicalproblems are related to data rather than to meaning. Second problem is Semantic problems. These are related

to the meaning of data. This paper deals with solution of these semantic problems with the help of Novel

Schema Matching techniques.

Keywords: XML, RDF, Schema Matching, Global Schema Based Querying (GSQ)

1. Introduction

Nowadays number of information sources &

websites are increasing on the web. There are 32

million active websites on the internet accessed by onebillion Internet users [1, 2]. However, there are scores

of Web pages that users cannot navigate all of them to

gain the information needed. A system that can allow

users to query Web pages like a database is becoming

increasingly desirable. It means to an idea or fact.

Information is used for making decision [5]. However,

in information science it has been widely accepted that

information = data + meaning. Data is often obtained

as a result of recording or observation. The huge

amount of raw input is data and it has no meaning

when it exist in that form i.e. it is meaningless. For

example the temperature of days is data. When data is

to be assembled, a system or person monitors the dailytemperature and record it. Finally when it is converted

into meaningful information, the patterns in the

temperature are analyzed and a conclusion about

temperature is arrived at. While raw data will not give

any answer to questions, but information gives answers

to questions such as where, what, who, and when.

Data is used as input for processing and output of this

processing is known as information. Observation and

recording are finished to obtain data, while analysis is

finished to obtain information. From a broader

perspective, information extraction is part of the

learning process through which humans increase their

knowledge (consist some semantic meaning) andwisdom [7]. Fig 1 shows Raw Data-Information-Knowledge processing

Fig 1: Raw Data-Information-Knowledge processing

All type of data and information are stored on the

internet. Internet supports all type of the information

forms that can be digitized, and implement tools for

information exchange, such as mail, libraries, and

television broadcast, and furthermore, medium for

business communication, constantly introduces new

ones. So, Internet commencement overcomes all the

predecessor technology of information sources and hasnow become a language of communication for the

world. Extraction information in such an atmosphere is

only possible with the use of information extraction

tools. The search engine is the most popular

information extraction tool, has come to be essential

for Internet users although information extraction tools

are not perfect and still have many problems to solve.

These problems fall in two broad categories.

Technical or Syntactic problems: The syntactic

problems are related to data rather than to meaning.

These address data management problems such as data

representation, data storage, and data exchange

protocols. Technical problems have been solvedthrough the use of mature technologies, such as

databases for storing and querying large volumes of

data, and use of common Internet protocols for data

exchange on the Internet. Thanks to these technologies,


2/3

todays Internet users are mostly unaware of technical

problems.

Semantic problems: These are related to the meaning

of data. Semantic problems occur when there is a

disagreement about the meaning, interpretation, and the

computer system which manipulates the information. It

cannot really understand the meaning of the

information. That type of problem is known as the

semantic gap. Internet users are well aware that

semantic problems exist. So we try to solve the

semantic problems on the Internet, through two

complementary directions.(1) It develops smarter tools, which try to capture and

use the meaning of information, and exact answering

of the user queries.

(2) To represent (i.e., store) and organize information,

become explicit and machine readable.

To illustrate these approaches, the following sections

briefly survey information representations (Sec. 2) and

information extraction tools (Sec. 3) on the Internet.

Sec. 3.4 proposes global schema based querying as anew way for gathering useful information from the

XML portion of the Internet. Different experts give thedefinition of Schema Matching (Sec 4).We have

explained Schema matching in global schema based

querying is presented in section 4.1. Finally section

5 conclude s this paper.

2. Information RepresentationHTML: which stands for Hyper T e x tMarkup

Language, is the documents describe w eb p a g es.It provides a means to create s tr u c t u r ed do c u m e n ts bydenoting structural s e m a n tics for text such as

headings, paragraphs, lists, links, quotes and otheritems. It allows i m a g es a n d ob j e c ts to be embeddedand can be used to create in te r a c ti v e f or m s. It iswritten in the form of H T ML el e m e n ts consisting of"tags" surrounded by a n g le br a c k e ts within theweb page content. The purpose of a web browser (likeInternet Explorer or Firefox) is to read HTMLdocuments and display them as web pages [5]. The

browser does not display the HTML tags, but uses thetags to interpret the content of the page [8].

XML: HTML quickly became a bottleneck in the

effort to store and manages huge volume of data on the

internet. In May 1996, Jon Bosak became the leader ofthe group responsible for adopting SGML for the use

on the internet and became a standard W3C

recommendation in February 1998. XML, Extensible

Markup Language (XML) is a set of rules for

encoding documents in m a ch i n e - r e ad a b le form. It was

designed to carry data, not to display data. XML tags

are not predefined so you must define your own tags.

XML documents are further proceed as needed. For

example, to display information, technologies such as

XSLT or XQuery [5], transform an XML document to

a desired HTML document.

RDF: In 1998, Tim Berners-Lee conceives the

Semantic Web: an information representationtechnique geared towards enabling fully automatic

reasoning over the represented information. The core

components of the technique are the Resource

Description Framework (RDF) and the Web Ontology

Language (OWL). RDF models information as a set of

subject-predicate-object triples, where predicate is a

directed relation between two resources: the subject

and the object [8].

The three information representation techniques,

HTML, XML, and RDF with OWL, presently coexist

on the Internet. However, the research in this paper

only considers XML as a format for representing

information. Furthermore, as both XML and the

Semantic Web tackle the problem of semantic

heterogeneity, so this problem consider in thispaper.

We envision internet as a massive library with no

Catalog and no staff to support. It is not easy to

remember all the addresses of required sites/sources.

Developer develops a number of search tools or

services for this purpose. Provide Browse/Search

interface to retrieve and access the requiredinformation. In this section we will first discuss the

three most general information extraction tools:

Directories, Search Engine, Meta Search Engines and

Tools for Extraction People and then took a look atresearch efforts for carrying more powerfulinformation extraction in XML environment.

3. Information extraction toolsBasic Search Strategy to Search the web is: From the

user point of view, the three stages can be explained or

distinguished here. It is common to used by all types of

tools as mentioned above. (see Fig. 2)

Fig 2: Three information stages in the information extractionprocess

User Requesting Information: This is the first stage

where the user types the query or keyword in the query

interface and submits this query to the information

extraction tool. After receiving the query, the tool is to

search on the internet and find out the information

source which contains the requested information.

Understanding the Information Source: When the

information extraction tool has finished the search and

has discovered multiple sources with information

which contain requested information. Due to the

semantic gap, the information extraction tool cannot

reach on the perfect information sources but still solve

so many problems.

Information Processing: Information source usuallycontain massive amount of information. In this stageuser analyze or filter whole information and find outexact information from this information source.

Different information extraction tools provide differentsupport to the user in these three stages, and differentlyorganize the information extraction workflow. First and
http://en.wikipedia.org/wiki/HyperTexthttp://en.wikipedia.org/wiki/HyperTexthttp://en.wikipedia.org/wiki/HyperTexthttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Structured_documenthttp://en.wikipedia.org/wiki/Structured_documenthttp://en.wikipedia.org/wiki/Structured_documenthttp://en.wikipedia.org/wiki/Semantichttp://en.wikipedia.org/wiki/Semantichttp://en.wikipedia.org/wiki/Semantichttp://en.wikipedia.org/wiki/HTML_element#Images_and_objectshttp://en.wikipedia.org/wiki/HTML_element#Images_and_objectshttp://en.wikipedia.org/wiki/HTML_element#Images_and_objectshttp://en.wikipedia.org/wiki/HTML_element#Formshttp://en.wikipedia.org/wiki/HTML_element#Formshttp://en.wikipedia.org/wiki/HTML_element#Formshttp://en.wikipedia.org/wiki/HTML_element#Formshttp://en.wikipedia.org/wiki/HTML_elementhttp://en.wikipedia.org/wiki/HTML_elementhttp://en.wikipedia.org/wiki/HTML_elementhttp://en.wikipedia.org/wiki/Bracket#Angle_Brackets_or_Chevrons_.E2.9F.A8_.E2.9F.A9http://en.wikipedia.org/wiki/Bracket#Angle_Brackets_or_Chevrons_.E2.9F.A8_.E2.9F.A9http://en.wikipedia.org/wiki/Bracket#Angle_Brackets_or_Chevrons_.E2.9F.A8_.E2.9F.A9http://en.wikipedia.org/wiki/Bracket#Angle_Brackets_or_Chevrons_.E2.9F.A8_.E2.9F.A9http://en.wikipedia.org/wiki/Machine-readablehttp://en.wikipedia.org/wiki/Machine-readablehttp://en.wikipedia.org/wiki/Machine-readablehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Structured_documenthttp://en.wikipedia.org/wiki/Semantichttp://en.wikipedia.org/wiki/HTML_element#Images_and_objectshttp://en.wikipedia.org/wiki/HTML_element#Formshttp://en.wikipedia.org/wiki/HTML_elementhttp://en.wikipedia.org/wiki/Bracket#Angle_Brackets_or_Chevrons_.E2.9F.A8_.E2.9F.A9http://en.wikipedia.org/wiki/Machine-readablehttp://en.wikipedia.org/wiki/HyperTexthttp://en.wikipedia.org/wiki/HyperText


3/3

second stage is common near about every information

extraction tools. In some case, third stage is used.

Search Engine

Fig. 4 Block diagram of Search Engine

A web search engine [11] is designed to search forinformation on the web (www). A search engine is a

kind of Information retrieval (IR) system [9] whichsees the Internet as a document collection.Information extraction with a search engine is

illustrated in Fig. 4. In search engine, the user

expresses his information need with a query K or

keyword, and the IR system provides the answer by

pointing to the documents which are the most relevant

to the users query. To help the user in understanding

the proposed documents, search engine takes you to

the exact page on which the words and phrases you are

looking for appear. A search engine, in general, does

not provide support for information processing.

Block diagram of search engine is determined by 2

requirements.1] Effectiveness (quality of result)

2] Efficiency (response time & throughput)

In case of Web directories and Meta search engineInformation processing phase is not supported.

The large amount of XML data sources on the Internet

is increasing, and the Internet to come together from

unstructured in the direction of a more structured state.

In current research, two approaches are considered for

this problem: (1) User responsible for representation

mismatches between his query and the structure of data,

and (2) System responsible for representation

mismatches. Tools are used for free querying of

structured data on the Internet can be considered bothas information extraction and querying tools. As a

result, we use both terms to describe the purpose of

these systems.So, we introduce our technique for

structured querying of unidentified XML data: global

schema based querying. The technique is used for the

second group of techniques, i.e., the system takes the

responsibility for resolving representation mismatches,

and is similar to the idea of semantic query processing.

The following section describes the technique.

4. Global Schema Based Querying (GSQ)Suppose, we imagine an information extraction systemthat is permitting a user to define his specialintermediated view called global schema. During aglobal schema, the user expresses his currentinformation that is needed and desired information isexpected with respect to the structure. For example,

Library information, a global schema similar to the one

given in Fig. 5a.

/lib/publication/(author,

Dickens)/title

Fig 5a: global schema Fig 5b: global query

These global schemas represent Library inform

ation that is private, and the actual Library information

structure may be different in the internet, but user is

unaware from this knowledge. The user is also allowed

to ask queries on his global schema. These queries are

called global queries. For the user point of view, the

three stages of information extraction of a GSQ can beexplained or distinguished here in

Fig. 7. In stage 1, first the user submits the query in his

global schema. The main responsibility of GSQ is to

find data sources on the Internet to match this query

with the local schema. The GSQ finds many schema

mappings. Mapping might be in the form of a query, orit might be a set of expressions between items in each

schema

Fig 7: Using GSQ to find information on the XML Web

In stage 2, when implemented, a GSQ must have at

least two parts: a schema matcher and a query

evaluator.

The schema matcher is responsible formatching the global schema, supplied by the user,

against the schemas of the Internet, which are

presumably stored in a large schema repository. The

schema matcher delivers a list of possible mappings

between the two.

In stage 3, the query evaluator of the GSQ rewrites theglobal query q into a query over a concrete data sourceqi, and optionally, transforms back the answer ai into

the structure corresponding to that of the global schema.However, the idea of global schema based querying isnot entirely new and shares many properties and

problems with other techniques mentioned above. This

paper addresses problems of interest to the broader areaof structured querying of distributed XML informationsources.

icait2011 submission 5

Documents