icait2011 submission 5

Upload: jeetbhatia

Post on 06-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Icait2011 Submission 5

    1/3

    ISSN:2010-460X

    Proceedings of 2011 International Conference on Advancements in Information

    Technology (ICAIT 2011)

    Chennai, India, 17-18 December, 2011,

    An algorithmic Schema Matching Technique for Information Extraction on the Web

    Suresh Jain1, C. S.Bhatia

    2

    2SIA,Indore

    1KCBTA, Indore

    Abstract. World today is facing the challenge of handling and managing information expansion, which isexpanding at exponential rate. Search engines are powerful to extract HTML based information stored over thewebsites. However, the search engines have limitations extracting information stored in backend database dueto schema matching problems or security reasons.On the web there are so many problems still to solve, so that we can not make our searches precise & relevant

    to context. These problems fall in different categories. First problem is technical problem. These technicalproblems are related to data rather than to meaning. Second problem is Semantic problems. These are related

    to the meaning of data. This paper deals with solution of these semantic problems with the help of Novel

    Schema Matching techniques.

    Keywords: XML, RDF, Schema Matching, Global Schema Based Querying (GSQ)

    1. Introduction

    Nowadays number of information sources &

    websites are increasing on the web. There are 32

    million active websites on the internet accessed by onebillion Internet users [1, 2]. However, there are scores

    of Web pages that users cannot navigate all of them to

    gain the information needed. A system that can allow

    users to query Web pages like a database is becoming

    increasingly desirable. It means to an idea or fact.

    Information is used for making decision [5]. However,

    in information science it has been widely accepted that

    information = data + meaning. Data is often obtained

    as a result of recording or observation. The huge

    amount of raw input is data and it has no meaning

    when it exist in that form i.e. it is meaningless. For

    example the temperature of days is data. When data is

    to be assembled, a system or person monitors the dailytemperature and record it. Finally when it is converted

    into meaningful information, the patterns in the

    temperature are analyzed and a conclusion about

    temperature is arrived at. While raw data will not give

    any answer to questions, but information gives answers

    to questions such as where, what, who, and when.

    Data is used as input for processing and output of this

    processing is known as information. Observation and

    recording are finished to obtain data, while analysis is

    finished to obtain information. From a broader

    perspective, information extraction is part of the

    learning process through which humans increase their

    knowledge (consist some semantic meaning) andwisdom [7]. Fig 1 shows Raw Data-Information-Knowledge processing

    Fig 1: Raw Data-Information-Knowledge processing

    All type of data and information are stored on the

    internet. Internet supports all type of the information

    forms that can be digitized, and implement tools for

    information exchange, such as mail, libraries, and

    television broadcast, and furthermore, medium for

    business communication, constantly introduces new

    ones. So, Internet commencement overcomes all the

    predecessor technology of information sources and hasnow become a language of communication for the

    world. Extraction information in such an atmosphere is

    only possible with the use of information extraction

    tools. The search engine is the most popular

    information extraction tool, has come to be essential

    for Internet users although information extraction tools

    are not perfect and still have many problems to solve.

    These problems fall in two broad categories.

    Technical or Syntactic problems: The syntactic

    problems are related to data rather than to meaning.

    These address data management problems such as data

    representation, data storage, and data exchange

    protocols. Technical problems have been solvedthrough the use of mature technologies, such as

    databases for storing and querying large volumes of

    data, and use of common Internet protocols for data

    exchange on the Internet. Thanks to these technologies,

  • 8/3/2019 Icait2011 Submission 5

    2/3

    todays Internet users are mostly unaware of technical

    problems.

    Semantic problems: These are related to the meaning

    of data. Semantic problems occur when there is a

    disagreement about the meaning, interpretation, and the

    computer system which manipulates the information. It

    cannot really understand the meaning of the

    information. That type of problem is known as the

    semantic gap. Internet users are well aware that

    semantic problems exist. So we try to solve the

    semantic problems on the Internet, through two

    complementary directions.(1) It develops smarter tools, which try to capture and

    use the meaning of information, and exact answering

    of the user queries.

    (2) To represent (i.e., store) and organize information,

    become explicit and machine readable.

    To illustrate these approaches, the following sections

    briefly survey information representations (Sec. 2) and

    information extraction tools (Sec. 3) on the Internet.

    Sec. 3.4 proposes global schema based querying as anew way for gathering useful information from the

    XML portion of the Internet. Different experts give thedefinition of Schema Matching (Sec 4).We have

    explained Schema matching in global schema based

    querying is presented in section 4.1. Finally section

    5 conclude s this paper.

    2. Information RepresentationHTML: which stands for Hyper T e x tMarkup

    Language, is the documents describe w eb p a g es.It provides a means to create s tr u c t u r ed do c u m e n ts bydenoting structural s e m a n tics for text such as

    headings, paragraphs, lists, links, quotes and otheritems. It allows i m a g es a n d ob j e c ts to be embeddedand can be used to create in te r a c ti v e f or m s. It iswritten in the form of H T ML el e m e n ts consisting of"tags" surrounded by a n g le br a c k e ts within theweb page content. The purpose of a web browser (likeInternet Explorer or Firefox) is to read HTMLdocuments and display them as web pages [5]. The

    browser does not display the HTML tags, but uses thetags to interpret the content of the page [8].

    XML: HTML quickly became a bottleneck in the

    effort to store and manages huge volume of data on the

    internet. In May 1996, Jon Bosak became the leader ofthe group responsible for adopting SGML for the use

    on the internet and became a standard W3C

    recommendation in February 1998. XML, Extensible

    Markup Language (XML) is a set of rules for

    encoding documents in m a ch i n e - r e ad a b le form. It was

    designed to carry data, not to display data. XML tags

    are not predefined so you must define your own tags.

    XML documents are further proceed as needed. For

    example, to display information, technologies such as

    XSLT or XQuery [5], transform an XML document to

    a desired HTML document.

    RDF: In 1998, Tim Berners-Lee conceives the

    Semantic Web: an information representationtechnique geared towards enabling fully automatic

    reasoning over the represented information. The core

    components of the technique are the Resource

    Description Framework (RDF) and the Web Ontology

    Language (OWL). RDF models information as a set of

    subject-predicate-object triples, where predicate is a

    directed relation between two resources: the subject

    and the object [8].

    The three information representation techniques,

    HTML, XML, and RDF with OWL, presently coexist

    on the Internet. However, the research in this paper

    only considers XML as a format for representing

    information. Furthermore, as both XML and the

    Semantic Web tackle the problem of semantic

    heterogeneity, so this problem consider in thispaper.

    We envision internet as a massive library with no

    Catalog and no staff to support. It is not easy to

    remember all the addresses of required sites/sources.

    Developer develops a number of search tools or

    services for this purpose. Provide Browse/Search

    interface to retrieve and access the requiredinformation. In this section we will first discuss the

    three most general information extraction tools:

    Directories, Search Engine, Meta Search Engines and

    Tools for Extraction People and then took a look atresearch efforts for carrying more powerfulinformation extraction in XML environment.

    3. Information extraction toolsBasic Search Strategy to Search the web is: From the

    user point of view, the three stages can be explained or

    distinguished here. It is common to used by all types of

    tools as mentioned above. (see Fig. 2)

    Fig 2: Three information stages in the information extractionprocess

    User Requesting Information: This is the first stage

    where the user types the query or keyword in the query

    interface and submits this query to the information

    extraction tool. After receiving the query, the tool is to

    search on the internet and find out the information

    source which contains the requested information.

    Understanding the Information Source: When the

    information extraction tool has finished the search and

    has discovered multiple sources with information

    which contain requested information. Due to the

    semantic gap, the information extraction tool cannot

    reach on the perfect information sources but still solve

    so many problems.

    Information Processing: Information source usuallycontain massive amount of information. In this stageuser analyze or filter whole information and find outexact information from this information source.

    Different information extraction tools provide differentsupport to the user in these three stages, and differentlyorganize the information extraction workflow. First and

    http://en.wikipedia.org/wiki/HyperTexthttp://en.wikipedia.org/wiki/HyperTexthttp://en.wikipedia.org/wiki/HyperTexthttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Structured_documenthttp://en.wikipedia.org/wiki/Structured_documenthttp://en.wikipedia.org/wiki/Structured_documenthttp://en.wikipedia.org/wiki/Semantichttp://en.wikipedia.org/wiki/Semantichttp://en.wikipedia.org/wiki/Semantichttp://en.wikipedia.org/wiki/HTML_element#Images_and_objectshttp://en.wikipedia.org/wiki/HTML_element#Images_and_objectshttp://en.wikipedia.org/wiki/HTML_element#Images_and_objectshttp://en.wikipedia.org/wiki/HTML_element#Formshttp://en.wikipedia.org/wiki/HTML_element#Formshttp://en.wikipedia.org/wiki/HTML_element#Formshttp://en.wikipedia.org/wiki/HTML_element#Formshttp://en.wikipedia.org/wiki/HTML_elementhttp://en.wikipedia.org/wiki/HTML_elementhttp://en.wikipedia.org/wiki/HTML_elementhttp://en.wikipedia.org/wiki/Bracket#Angle_Brackets_or_Chevrons_.E2.9F.A8_.E2.9F.A9http://en.wikipedia.org/wiki/Bracket#Angle_Brackets_or_Chevrons_.E2.9F.A8_.E2.9F.A9http://en.wikipedia.org/wiki/Bracket#Angle_Brackets_or_Chevrons_.E2.9F.A8_.E2.9F.A9http://en.wikipedia.org/wiki/Bracket#Angle_Brackets_or_Chevrons_.E2.9F.A8_.E2.9F.A9http://en.wikipedia.org/wiki/Machine-readablehttp://en.wikipedia.org/wiki/Machine-readablehttp://en.wikipedia.org/wiki/Machine-readablehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Structured_documenthttp://en.wikipedia.org/wiki/Semantichttp://en.wikipedia.org/wiki/HTML_element#Images_and_objectshttp://en.wikipedia.org/wiki/HTML_element#Formshttp://en.wikipedia.org/wiki/HTML_elementhttp://en.wikipedia.org/wiki/Bracket#Angle_Brackets_or_Chevrons_.E2.9F.A8_.E2.9F.A9http://en.wikipedia.org/wiki/Machine-readablehttp://en.wikipedia.org/wiki/HyperTexthttp://en.wikipedia.org/wiki/HyperText
  • 8/3/2019 Icait2011 Submission 5

    3/3

    second stage is common near about every information

    extraction tools. In some case, third stage is used.

    Search Engine

    Fig. 4 Block diagram of Search Engine

    A web search engine [11] is designed to search forinformation on the web (www). A search engine is a

    kind of Information retrieval (IR) system [9] whichsees the Internet as a document collection.Information extraction with a search engine is

    illustrated in Fig. 4. In search engine, the user

    expresses his information need with a query K or

    keyword, and the IR system provides the answer by

    pointing to the documents which are the most relevant

    to the users query. To help the user in understanding

    the proposed documents, search engine takes you to

    the exact page on which the words and phrases you are

    looking for appear. A search engine, in general, does

    not provide support for information processing.

    Block diagram of search engine is determined by 2

    requirements.1] Effectiveness (quality of result)

    2] Efficiency (response time & throughput)

    In case of Web directories and Meta search engineInformation processing phase is not supported.

    The large amount of XML data sources on the Internet

    is increasing, and the Internet to come together from

    unstructured in the direction of a more structured state.

    In current research, two approaches are considered for

    this problem: (1) User responsible for representation

    mismatches between his query and the structure of data,

    and (2) System responsible for representation

    mismatches. Tools are used for free querying of

    structured data on the Internet can be considered bothas information extraction and querying tools. As a

    result, we use both terms to describe the purpose of

    these systems.So, we introduce our technique for

    structured querying of unidentified XML data: global

    schema based querying. The technique is used for the

    second group of techniques, i.e., the system takes the

    responsibility for resolving representation mismatches,

    and is similar to the idea of semantic query processing.

    The following section describes the technique.

    4. Global Schema Based Querying (GSQ)Suppose, we imagine an information extraction systemthat is permitting a user to define his specialintermediated view called global schema. During aglobal schema, the user expresses his currentinformation that is needed and desired information isexpected with respect to the structure. For example,

    Library information, a global schema similar to the one

    given in Fig. 5a.

    /lib/publication/(author,

    Dickens)/title

    Fig 5a: global schema Fig 5b: global query

    These global schemas represent Library inform

    ation that is private, and the actual Library information

    structure may be different in the internet, but user is

    unaware from this knowledge. The user is also allowed

    to ask queries on his global schema. These queries are

    called global queries. For the user point of view, the

    three stages of information extraction of a GSQ can beexplained or distinguished here in

    Fig. 7. In stage 1, first the user submits the query in his

    global schema. The main responsibility of GSQ is to

    find data sources on the Internet to match this query

    with the local schema. The GSQ finds many schema

    mappings. Mapping might be in the form of a query, orit might be a set of expressions between items in each

    schema

    Fig 7: Using GSQ to find information on the XML Web

    In stage 2, when implemented, a GSQ must have at

    least two parts: a schema matcher and a query

    evaluator.

    The schema matcher is responsible formatching the global schema, supplied by the user,

    against the schemas of the Internet, which are

    presumably stored in a large schema repository. The

    schema matcher delivers a list of possible mappings

    between the two.

    In stage 3, the query evaluator of the GSQ rewrites theglobal query q into a query over a concrete data sourceqi, and optionally, transforms back the answer ai into

    the structure corresponding to that of the global schema.However, the idea of global schema based querying isnot entirely new and shares many properties and

    problems with other techniques mentioned above. This

    paper addresses problems of interest to the broader areaof structured querying of distributed XML informationsources.