icait2011 submission 5
TRANSCRIPT
-
8/3/2019 Icait2011 Submission 5
1/3
ISSN:2010-460X
Proceedings of 2011 International Conference on Advancements in Information
Technology (ICAIT 2011)
Chennai, India, 17-18 December, 2011,
An algorithmic Schema Matching Technique for Information Extraction on the Web
Suresh Jain1, C. S.Bhatia
2
2SIA,Indore
1KCBTA, Indore
Abstract. World today is facing the challenge of handling and managing information expansion, which isexpanding at exponential rate. Search engines are powerful to extract HTML based information stored over thewebsites. However, the search engines have limitations extracting information stored in backend database dueto schema matching problems or security reasons.On the web there are so many problems still to solve, so that we can not make our searches precise & relevant
to context. These problems fall in different categories. First problem is technical problem. These technicalproblems are related to data rather than to meaning. Second problem is Semantic problems. These are related
to the meaning of data. This paper deals with solution of these semantic problems with the help of Novel
Schema Matching techniques.
Keywords: XML, RDF, Schema Matching, Global Schema Based Querying (GSQ)
1. Introduction
Nowadays number of information sources &
websites are increasing on the web. There are 32
million active websites on the internet accessed by onebillion Internet users [1, 2]. However, there are scores
of Web pages that users cannot navigate all of them to
gain the information needed. A system that can allow
users to query Web pages like a database is becoming
increasingly desirable. It means to an idea or fact.
Information is used for making decision [5]. However,
in information science it has been widely accepted that
information = data + meaning. Data is often obtained
as a result of recording or observation. The huge
amount of raw input is data and it has no meaning
when it exist in that form i.e. it is meaningless. For
example the temperature of days is data. When data is
to be assembled, a system or person monitors the dailytemperature and record it. Finally when it is converted
into meaningful information, the patterns in the
temperature are analyzed and a conclusion about
temperature is arrived at. While raw data will not give
any answer to questions, but information gives answers
to questions such as where, what, who, and when.
Data is used as input for processing and output of this
processing is known as information. Observation and
recording are finished to obtain data, while analysis is
finished to obtain information. From a broader
perspective, information extraction is part of the
learning process through which humans increase their
knowledge (consist some semantic meaning) andwisdom [7]. Fig 1 shows Raw Data-Information-Knowledge processing
Fig 1: Raw Data-Information-Knowledge processing
All type of data and information are stored on the
internet. Internet supports all type of the information
forms that can be digitized, and implement tools for
information exchange, such as mail, libraries, and
television broadcast, and furthermore, medium for
business communication, constantly introduces new
ones. So, Internet commencement overcomes all the
predecessor technology of information sources and hasnow become a language of communication for the
world. Extraction information in such an atmosphere is
only possible with the use of information extraction
tools. The search engine is the most popular
information extraction tool, has come to be essential
for Internet users although information extraction tools
are not perfect and still have many problems to solve.
These problems fall in two broad categories.
Technical or Syntactic problems: The syntactic
problems are related to data rather than to meaning.
These address data management problems such as data
representation, data storage, and data exchange
protocols. Technical problems have been solvedthrough the use of mature technologies, such as
databases for storing and querying large volumes of
data, and use of common Internet protocols for data
exchange on the Internet. Thanks to these technologies,
-
8/3/2019 Icait2011 Submission 5
2/3
todays Internet users are mostly unaware of technical
problems.
Semantic problems: These are related to the meaning
of data. Semantic problems occur when there is a
disagreement about the meaning, interpretation, and the
computer system which manipulates the information. It
cannot really understand the meaning of the
information. That type of problem is known as the
semantic gap. Internet users are well aware that
semantic problems exist. So we try to solve the
semantic problems on the Internet, through two
complementary directions.(1) It develops smarter tools, which try to capture and
use the meaning of information, and exact answering
of the user queries.
(2) To represent (i.e., store) and organize information,
become explicit and machine readable.
To illustrate these approaches, the following sections
briefly survey information representations (Sec. 2) and
information extraction tools (Sec. 3) on the Internet.
Sec. 3.4 proposes global schema based querying as anew way for gathering useful information from the
XML portion of the Internet. Different experts give thedefinition of Schema Matching (Sec 4).We have
explained Schema matching in global schema based
querying is presented in section 4.1. Finally section
5 conclude s this paper.
2. Information RepresentationHTML: which stands for Hyper T e x tMarkup
Language, is the documents describe w eb p a g es.It provides a means to create s tr u c t u r ed do c u m e n ts bydenoting structural s e m a n tics for text such as
headings, paragraphs, lists, links, quotes and otheritems. It allows i m a g es a n d ob j e c ts to be embeddedand can be used to create in te r a c ti v e f or m s. It iswritten in the form of H T ML el e m e n ts consisting of"tags" surrounded by a n g le br a c k e ts within theweb page content. The purpose of a web browser (likeInternet Explorer or Firefox) is to read HTMLdocuments and display them as web pages [5]. The
browser does not display the HTML tags, but uses thetags to interpret the content of the page [8].
XML: HTML quickly became a bottleneck in the
effort to store and manages huge volume of data on the
internet. In May 1996, Jon Bosak became the leader ofthe group responsible for adopting SGML for the use
on the internet and became a standard W3C
recommendation in February 1998. XML, Extensible
Markup Language (XML) is a set of rules for
encoding documents in m a ch i n e - r e ad a b le form. It was
designed to carry data, not to display data. XML tags
are not predefined so you must define your own tags.
XML documents are further proceed as needed. For
example, to display information, technologies such as
XSLT or XQuery [5], transform an XML document to
a desired HTML document.
RDF: In 1998, Tim Berners-Lee conceives the
Semantic Web: an information representationtechnique geared towards enabling fully automatic
reasoning over the represented information. The core
components of the technique are the Resource
Description Framework (RDF) and the Web Ontology
Language (OWL). RDF models information as a set of
subject-predicate-object triples, where predicate is a
directed relation between two resources: the subject
and the object [8].
The three information representation techniques,
HTML, XML, and RDF with OWL, presently coexist
on the Internet. However, the research in this paper
only considers XML as a format for representing
information. Furthermore, as both XML and the
Semantic Web tackle the problem of semantic
heterogeneity, so this problem consider in thispaper.
We envision internet as a massive library with no
Catalog and no staff to support. It is not easy to
remember all the addresses of required sites/sources.
Developer develops a number of search tools or
services for this purpose. Provide Browse/Search
interface to retrieve and access the requiredinformation. In this section we will first discuss the
three most general information extraction tools:
Directories, Search Engine, Meta Search Engines and
Tools for Extraction People and then took a look atresearch efforts for carrying more powerfulinformation extraction in XML environment.
3. Information extraction toolsBasic Search Strategy to Search the web is: From the
user point of view, the three stages can be explained or
distinguished here. It is common to used by all types of
tools as mentioned above. (see Fig. 2)
Fig 2: Three information stages in the information extractionprocess
User Requesting Information: This is the first stage
where the user types the query or keyword in the query
interface and submits this query to the information
extraction tool. After receiving the query, the tool is to
search on the internet and find out the information
source which contains the requested information.
Understanding the Information Source: When the
information extraction tool has finished the search and
has discovered multiple sources with information
which contain requested information. Due to the
semantic gap, the information extraction tool cannot
reach on the perfect information sources but still solve
so many problems.
Information Processing: Information source usuallycontain massive amount of information. In this stageuser analyze or filter whole information and find outexact information from this information source.
Different information extraction tools provide differentsupport to the user in these three stages, and differentlyorganize the information extraction workflow. First and
http://en.wikipedia.org/wiki/HyperTexthttp://en.wikipedia.org/wiki/HyperTexthttp://en.wikipedia.org/wiki/HyperTexthttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Structured_documenthttp://en.wikipedia.org/wiki/Structured_documenthttp://en.wikipedia.org/wiki/Structured_documenthttp://en.wikipedia.org/wiki/Semantichttp://en.wikipedia.org/wiki/Semantichttp://en.wikipedia.org/wiki/Semantichttp://en.wikipedia.org/wiki/HTML_element#Images_and_objectshttp://en.wikipedia.org/wiki/HTML_element#Images_and_objectshttp://en.wikipedia.org/wiki/HTML_element#Images_and_objectshttp://en.wikipedia.org/wiki/HTML_element#Formshttp://en.wikipedia.org/wiki/HTML_element#Formshttp://en.wikipedia.org/wiki/HTML_element#Formshttp://en.wikipedia.org/wiki/HTML_element#Formshttp://en.wikipedia.org/wiki/HTML_elementhttp://en.wikipedia.org/wiki/HTML_elementhttp://en.wikipedia.org/wiki/HTML_elementhttp://en.wikipedia.org/wiki/Bracket#Angle_Brackets_or_Chevrons_.E2.9F.A8_.E2.9F.A9http://en.wikipedia.org/wiki/Bracket#Angle_Brackets_or_Chevrons_.E2.9F.A8_.E2.9F.A9http://en.wikipedia.org/wiki/Bracket#Angle_Brackets_or_Chevrons_.E2.9F.A8_.E2.9F.A9http://en.wikipedia.org/wiki/Bracket#Angle_Brackets_or_Chevrons_.E2.9F.A8_.E2.9F.A9http://en.wikipedia.org/wiki/Machine-readablehttp://en.wikipedia.org/wiki/Machine-readablehttp://en.wikipedia.org/wiki/Machine-readablehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Structured_documenthttp://en.wikipedia.org/wiki/Semantichttp://en.wikipedia.org/wiki/HTML_element#Images_and_objectshttp://en.wikipedia.org/wiki/HTML_element#Formshttp://en.wikipedia.org/wiki/HTML_elementhttp://en.wikipedia.org/wiki/Bracket#Angle_Brackets_or_Chevrons_.E2.9F.A8_.E2.9F.A9http://en.wikipedia.org/wiki/Machine-readablehttp://en.wikipedia.org/wiki/HyperTexthttp://en.wikipedia.org/wiki/HyperText -
8/3/2019 Icait2011 Submission 5
3/3
second stage is common near about every information
extraction tools. In some case, third stage is used.
Search Engine
Fig. 4 Block diagram of Search Engine
A web search engine [11] is designed to search forinformation on the web (www). A search engine is a
kind of Information retrieval (IR) system [9] whichsees the Internet as a document collection.Information extraction with a search engine is
illustrated in Fig. 4. In search engine, the user
expresses his information need with a query K or
keyword, and the IR system provides the answer by
pointing to the documents which are the most relevant
to the users query. To help the user in understanding
the proposed documents, search engine takes you to
the exact page on which the words and phrases you are
looking for appear. A search engine, in general, does
not provide support for information processing.
Block diagram of search engine is determined by 2
requirements.1] Effectiveness (quality of result)
2] Efficiency (response time & throughput)
In case of Web directories and Meta search engineInformation processing phase is not supported.
The large amount of XML data sources on the Internet
is increasing, and the Internet to come together from
unstructured in the direction of a more structured state.
In current research, two approaches are considered for
this problem: (1) User responsible for representation
mismatches between his query and the structure of data,
and (2) System responsible for representation
mismatches. Tools are used for free querying of
structured data on the Internet can be considered bothas information extraction and querying tools. As a
result, we use both terms to describe the purpose of
these systems.So, we introduce our technique for
structured querying of unidentified XML data: global
schema based querying. The technique is used for the
second group of techniques, i.e., the system takes the
responsibility for resolving representation mismatches,
and is similar to the idea of semantic query processing.
The following section describes the technique.
4. Global Schema Based Querying (GSQ)Suppose, we imagine an information extraction systemthat is permitting a user to define his specialintermediated view called global schema. During aglobal schema, the user expresses his currentinformation that is needed and desired information isexpected with respect to the structure. For example,
Library information, a global schema similar to the one
given in Fig. 5a.
/lib/publication/(author,
Dickens)/title
Fig 5a: global schema Fig 5b: global query
These global schemas represent Library inform
ation that is private, and the actual Library information
structure may be different in the internet, but user is
unaware from this knowledge. The user is also allowed
to ask queries on his global schema. These queries are
called global queries. For the user point of view, the
three stages of information extraction of a GSQ can beexplained or distinguished here in
Fig. 7. In stage 1, first the user submits the query in his
global schema. The main responsibility of GSQ is to
find data sources on the Internet to match this query
with the local schema. The GSQ finds many schema
mappings. Mapping might be in the form of a query, orit might be a set of expressions between items in each
schema
Fig 7: Using GSQ to find information on the XML Web
In stage 2, when implemented, a GSQ must have at
least two parts: a schema matcher and a query
evaluator.
The schema matcher is responsible formatching the global schema, supplied by the user,
against the schemas of the Internet, which are
presumably stored in a large schema repository. The
schema matcher delivers a list of possible mappings
between the two.
In stage 3, the query evaluator of the GSQ rewrites theglobal query q into a query over a concrete data sourceqi, and optionally, transforms back the answer ai into
the structure corresponding to that of the global schema.However, the idea of global schema based querying isnot entirely new and shares many properties and
problems with other techniques mentioned above. This
paper addresses problems of interest to the broader areaof structured querying of distributed XML informationsources.