crowd aided web search: concept and implementation · crowd search and crowd sourcing is social...

10
Crowd Aided Web Search: Concept and Implementation 1 Crowd Aided Web Search: Concept and Implementation Chun-Hsiung Tseng 1 , Non-member ABSTRACT Although keyword-based search algorithms usually do their jobs well, they may sometimes yield weird results. Despite of the fact that the Web is the largest database, comparing to relational databases, the set of search operations for the Web is still primitive. This paper proposes two ways to remedy this: first, advanced information sources should be created. The role of advanced information sources of the Web is analogous to views of relational databases. Second, we propose several data processing tools based on the concept of advanced information sources. With these two mechanisms, the researcher tries to distinguish the data-centric view from the presentation view of the Web. In the paper, both the concept and an implementation are proposed. Keywords: CrowdIntelligence, CrowdSources, In- formation Extractionn 1. INTRODUCTION The objective of existing search engines is to help Web users locate their needed information accu- rately and efficiently. The Web is an extremely huge database flooded with information. To locate a spe- cific piece of information in the Web is probably more difficult than to locate a needle in a haystack. Today, people usually rely on keyword-based search mecha- nisms. For keyword-based searches to perform well, both computation power of machines and carefully designed algorithms are needed to process such huge amount of information. For example, one of them utilize the combination of map-reduce, page rank, inverted-index, and cloud-based infrastructure and achieve pretty good results in most general cases. However, keyword-based search algorithms may sometimes yield weird results despite the fact that they usually do their jobs well. Why? A possi- ble cause is that modern Web search engines focus on syntactic part rather than semantic part of Web pages. They have no idea about the semantic and context of a submitted query. Generally speaking, what keyword-based search algorithms do is recording Manuscript received on May 6, 2014 ; revised on October 4, 2014. Final manuscript received October 29, 2014. 1 The author is with Department of Information Management, Nanhua University, Taiwan, E-mail: lendle [email protected] the relationships between keywords and documents and simply return lists of relevant documents when related keywords are submitted. In some circum- stances, people are not satisfied with simple keyword- based searches, and some questions can only be an- swered by human beings. For example, the question “where is the best place to go on vacation”, may be overly complex for most modern Web search engines but pretty simple for human beings. Although there are semantic Web related technologies, wide adop- tion is still lacked. Furthermore, instead of simple page lists, sometimes aggregated results are preferred. An example is: people may ask “what is the average housing price in a specific area?” Such questions are also difficult for ordinary Web search engines. The above scenarios show that, in many circum- stances, people expect more from search engines. Simple page lists, the most common type of re- sults produced by modern search engines, are some- times not sufficient. Considering the Web as a huge database, Web pages are actually raw data within the database. The straightforward simple page lists are just unprocessed data. Compared with query tech- nologies for relational database, operators provided by modern Web search engines are very primitive. Powerful query operators can be demanded to solve certain problems. However, complex operators are double-edged swords since they can complicate the search operation. Simplicity is also an important fac- tor of the wide adoption of Web search engine tech- nologies. Is it possible to find a solution to enhance the functionality and still keep simplicity? A pos- sible solution is to keep operators simple but allow data sources to contain richer information. An exam- ple is views of relational databases, which allow pre- processing (or pre-configuration) of the original data but provide the same operators as normal relational tables. With similar concepts, it appears reasonable to improve Web search results with advanced infor- mation sources providing simple operators, that is, we can keep keyword-based interfaces but provide mech- anisms to construct powerful information sources. Definitely, the approach has drawbacks. First, skills and efforts are needed to construct these power- ful information sources. Second, it will be difficult to evaluate the quality of an information source. Third, the performance of Web searches can degrade due to the increased complexity. And last but not the least, frauds can happen since ordinary users may not be able to distinguish between original Web pages and

Upload: others

Post on 14-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Crowd Aided Web Search: Concept and Implementation · crowd search and crowd sourcing is social bookmark-ing. As shown in Heymanns research work [5], social bookmarking is a recent

Crowd Aided Web Search: Concept and Implementation 1

Crowd Aided Web Search: Concept andImplementation

Chun-Hsiung Tseng1 , Non-member

ABSTRACT

Although keyword-based search algorithms usuallydo their jobs well, they may sometimes yield weirdresults. Despite of the fact that the Web is the largestdatabase, comparing to relational databases, the setof search operations for the Web is still primitive.This paper proposes two ways to remedy this: first,advanced information sources should be created. Therole of advanced information sources of the Web isanalogous to views of relational databases. Second,we propose several data processing tools based on theconcept of advanced information sources. With thesetwo mechanisms, the researcher tries to distinguishthe data-centric view from the presentation view ofthe Web. In the paper, both the concept and animplementation are proposed.

Keywords: CrowdIntelligence, CrowdSources, In-formation Extractionn

1. INTRODUCTION

The objective of existing search engines is to helpWeb users locate their needed information accu-rately and efficiently. The Web is an extremely hugedatabase flooded with information. To locate a spe-cific piece of information in the Web is probably moredifficult than to locate a needle in a haystack. Today,people usually rely on keyword-based search mecha-nisms. For keyword-based searches to perform well,both computation power of machines and carefullydesigned algorithms are needed to process such hugeamount of information. For example, one of themutilize the combination of map-reduce, page rank,inverted-index, and cloud-based infrastructure andachieve pretty good results in most general cases.

However, keyword-based search algorithms maysometimes yield weird results despite the fact thatthey usually do their jobs well. Why? A possi-ble cause is that modern Web search engines focuson syntactic part rather than semantic part of Webpages. They have no idea about the semantic andcontext of a submitted query. Generally speaking,what keyword-based search algorithms do is recording

Manuscript received on May 6, 2014 ; revised on October 4,2014.Final manuscript received October 29, 2014.1 The author is with Department of Information

Management, Nanhua University, Taiwan, E-mail:lendle [email protected]

the relationships between keywords and documentsand simply return lists of relevant documents whenrelated keywords are submitted. In some circum-stances, people are not satisfied with simple keyword-based searches, and some questions can only be an-swered by human beings. For example, the question“where is the best place to go on vacation”, may beoverly complex for most modern Web search enginesbut pretty simple for human beings. Although thereare semantic Web related technologies, wide adop-tion is still lacked. Furthermore, instead of simplepage lists, sometimes aggregated results are preferred.An example is: people may ask “what is the averagehousing price in a specific area?” Such questions arealso difficult for ordinary Web search engines.

The above scenarios show that, in many circum-stances, people expect more from search engines.Simple page lists, the most common type of re-sults produced by modern search engines, are some-times not sufficient. Considering the Web as a hugedatabase, Web pages are actually raw data within thedatabase. The straightforward simple page lists arejust unprocessed data. Compared with query tech-nologies for relational database, operators providedby modern Web search engines are very primitive.Powerful query operators can be demanded to solvecertain problems. However, complex operators aredouble-edged swords since they can complicate thesearch operation. Simplicity is also an important fac-tor of the wide adoption of Web search engine tech-nologies. Is it possible to find a solution to enhancethe functionality and still keep simplicity? A pos-sible solution is to keep operators simple but allowdata sources to contain richer information. An exam-ple is views of relational databases, which allow pre-processing (or pre-configuration) of the original databut provide the same operators as normal relationaltables. With similar concepts, it appears reasonableto improve Web search results with advanced infor-mation sources providing simple operators, that is, wecan keep keyword-based interfaces but provide mech-anisms to construct powerful information sources.

Definitely, the approach has drawbacks. First,skills and efforts are needed to construct these power-ful information sources. Second, it will be difficult toevaluate the quality of an information source. Third,the performance of Web searches can degrade due tothe increased complexity. And last but not the least,frauds can happen since ordinary users may not beable to distinguish between original Web pages and

Page 2: Crowd Aided Web Search: Concept and Implementation · crowd search and crowd sourcing is social bookmark-ing. As shown in Heymanns research work [5], social bookmarking is a recent

2 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.9, NO.1 May 2015

modified information sources. To remedy these, inthis research, a crowd-based approach is proposed.We leverage the power of crowds in several aspects:

1. sometimes, it can be difficult for inexperiencedWeb users to identify valuable information fromWeb pages; with a good annotation mechanism,the intelligence of crowds can help point out valu-able parts from Web pages and thus can help or-dinary users to acquire the information

2. different users can take different views for even thesame set of Web resources due to their differentbackgrounds; in many situations, these points ofviews can be so diverse and rich that they can beadopted to construct different information sourcesand other users can benefit from them

3. if used adequately, feedbacks generated fromcrowd intelligence can be good tools to againstfrauds

In this paper, at first, the concept of crowd aidedWeb search is presented. Required components areshown along with their definitions and usages. Fur-thermore, implementations of these components arealso included. To prove the concept is practical, aworking example is provided.

2. RELATED WORKS

Although Web search engines today are usuallyconsidered efficient, in some circumstances, they arenot, especially when semantics and human intelli-gence are of concern. Some queries simply cannot beanswered by machines alone. In such cases, humaninput is required [1]. The research field is typicallynamed as crowd search or crowd searching. It is notan easy task to mediate between responses from hu-man beings and search engines, and thus the researchfield is very challenging.

Crowd search is highly related with social network-ing [2]. The opinions collected within friends and ex-pert/local communities can be ultimately helpful forthe search task. For example, the question “find allimages that satisfy a given set of properties” can bedifficult for machines to proceed, but with the help ofhuman intelligence, answering the question becomessimpler [3]. A special query interface that let userspose questions and explore results spanning over mul-tiple sources was proposed in [4]. Another type ofcrowd search and crowd sourcing is social bookmark-ing. As shown in Heymanns research work [5], socialbookmarking is a recent phenomenon which has thepotential to give us a great deal of data about pageson the web.

Various crowd search and crowd sourcing systemshave been proposed. For example, the research ofParameswaran proposed a human intelligence-basedmethodology for solving the human-assisted graphsearch problem [6]. Amazons Amazon MechanicalTurk is a commercial product that enables computerprogrammers (known as Requesters) to co-ordinate

the use of human intelligence to perform tasks thatcomputers are currently unable to do [7].

Considering the Web as a huge database withplenty of information, existing query techniques arefar from perfect. Today, the most widely adoptedquery technique is the keyword-based search. The re-search of Konopnicki and Shmueli [8] examined sometrends in the domain of search, namely the emergenceof system-level search services and of the semanticweb. In the research, a SQLlike solution was sur-veyed. In [4], a YQL (Yahoo! Query Language [9]) isimplemented. The framework consists of some inter-action primitives and is aimed at supporting users infinding responses to multi-domain queries.

In additional to advanced Web query languages,some semantic-based techniques have been proposed.Trillo proposed a set of semantics techniques to groupthe results provided by a traditional search engineinto categories defined by the different meanings ofthe input keywords [10]. In the research, knowledgeprovided by ontologies available on the web was usedto dynamically define the possible categories. To in-fer semantic from Web pages, a frequently used tech-nique is annotations. Chun and Warner proposed se-mantic metadata and annotation of Deep Web Ser-vices (DWS), a reasoning component to assess therelevance of DWS for searching the Deep Web con-tents, using likelihood of occurrence of data sourcesthat contain the query terms, and present a methodof ranking the DWSs [11]. Sarkas, Paparizos, andTsaparas propsed an unsupervised mechanism for as-sessing the likelihood of a structured annotation [12].Sometimes, information sources with semantic infor-mation attached will be regarded to as structured in-formation. The research of Paparizos presented a sys-tem called HELIX for such information sources [13].

3. CROWD AIDED WEB SEARCH

Recall how people look for answers before the In-ternet age. Besides searching in books, newspapers,and magazines, etc., they may consult other peoplefor suggestions and solutions. Usually, the peoplewho are consulted are considered as possessing moreknowledge or experiences about the questions. Ofcourse, the quality of such processed information canvary greatly, but acquiring information in this way istypically efficient if the information sources (the do-main experts and the mediums to carry the informa-tion) are reliable. Today, the amount of informationpresented to users today is much more than the lastdecade. We have search engines, which are good atlocating the (possibly) desired pieces of informationamong Web resources. Standing in front of the in-formation flood, most Web users will at first look tosearch engines for help. Do search engines really rem-edy all the symptoms caused by the explosive amountof information?

Taking a deeper look into how search engines work,

Page 3: Crowd Aided Web Search: Concept and Implementation · crowd search and crowd sourcing is social bookmark-ing. As shown in Heymanns research work [5], social bookmarking is a recent

Crowd Aided Web Search: Concept and Implementation 3

one will find that the answer is probably no. By ex-tracting keywords from Web resources, building in-dexes of keywords, and tracking links between Webresources, search engines are good at handling thesyntactic part and they can locate Web resourcescontaining the given keywords efficiently. However,search engines employing keyword-based approachesactually have no idea about the semantics of theknowledge they present. A common weakness ofgeneric Web search engines is that, with no priori un-derstanding of the materials they processed, they per-form generic search within all Web resources. Hence,Web users without sufficient domain knowledge maynot be able to leverage the full power of search en-gines when they need domain-specific information.For example, for users with little or no IT knowl-edge, distinguishing “apple” that can be eaten fromthe brand “apple” may sometime not be easy. Insuch cases, domain-specific searches, i.e., searches tar-geting information sources of a specific domain, willgive better results than generic searches. Where dothese domain-specific information sources come from?Who is responsible for constructing such informationsources? Apparently, relying on few domain expertswill not work because of the huge amount of existingWeb resources. Utilizing crowd intelligence is a possi-ble cure; however, there may not be sufficient tools forthe adoption of crowd intelligence in the Web searchfield today.

Where do these domain-specific informationsources come from? Who is responsible for construct-ing such information sources? Apparently, relying onfew domain experts will not work because of the hugeamount of existing Web resources. Utilizing crowd in-telligence is a possible cure; however, there may notbe sufficient tools for the adoption of crowd intelli-gence in the Web search field today.

The goal of this research is to propose a platformwhich can simplify the tasks of aggregating the wis-dom of crowds to improve Web search results. Fromthe researchers point of view, there are several keycomponents to complete the platform:

1. data processing tools to be used by users who wantto contribute to the construction of informationsources

2. a specially designed search tool set that redirectsuser queries to suitable information sources builtby other Web users

3. a feedback and analytic system that collects userfeedbacks to adjust the ranking of informationsources

The following figure depicts the usage of the proposedplatform:

Fig.1: Overview of the platform.

In the following sub sections, these components willbe explained in detail.

3.1 Data Processing Tools

An information source stores a set of Web re-sources that are considered to be related to eachother. The framework introduced in this paper willbe usable only if there are information sources. Dataprocessing tools are designed for building user-definedinformation sources. Two types of data processingtools are included in the platform: annotation toolsand aggregating tools. Annotation tools are used foradding meta-information to existing Web resources.Meta-information associates grammars or meaning toWeb resources. As a result, these Web resources canbe further processed by other components. On theother hand, aggregating tools aggregate informationextracted from Web resources. These two types ofdata processing tools will be explained in detail in subsections below. Annotation tools consist of schemasub components, annotation sub components, and ag-gregating tools. They will be described in detail inthe following sub sections:

3.1.1 Schema sub componentsExtracting information from relational databases

is usually much easier than extracting informationfrom Web pages since the formers are structured in-formation sources while the latters are not. Thereare various information extraction technologies thatare designed for the Web. However, most of themsuffered from unstability and limited-adoption issues.To ease the data processing task, the proposed plat-form provides schema sub components that are usedfor designing and maintaining schemas. Here, theresearcher would like to avoid complex schema struc-tures such as XML schema and DTD files. Instead,the researcher proposes an Object-Oriented SchemaModel (OOSM). An OOSM is defined as the follow-ing:

(ROOT ELEMENT, (RULE)*)

That is, an OOSM consists of a root element anda list of rules. A ROOT ELEMENT is representedas a namespace URI and a local element name pair

Page 4: Crowd Aided Web Search: Concept and Implementation · crowd search and crowd sourcing is social bookmark-ing. As shown in Heymanns research work [5], social bookmarking is a recent

4 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.9, NO.1 May 2015

and is used to group OOSM instances into categories.Furthermore, a RULE is represented as the following:

(ELEMENT, CONSTRUCT1, CONSTRUCT2, , CONSTRUCT n

That is, a RULE consists of a root ELEMENT and aset of unordered schema CONSTRUCTs. Again, anELEMENT is represented by a namespace URI anda local element name. A schema CONSTRUCT isfurther defined as the following:

CONSTRUCT ::= ARRAY(ELEMENT) |ELEMENT

That is, a schema CONSTRUCT is either an array ofinstances of an ELEMENT or an ELEMENT alone.

Keeping simple is essential for OOSM to work.Note that the order of child COSTRUCTs of an EL-EMENT is not strictly enforced. They are just re-garded to as properties of an object according tothe convention of the object-oriented paradigm. Fur-thermore, there is no complex structure defined inOOSM. Different from complex standard such asXML schema, which defines quite a few structures:sequence, choice, and all, etc., there are only twotypes of CONSTRUCTs defined in OOSM: array ofELEMENTs and a single ELEMENT. The reason ofsticking to simplicity is that most Web users are notdomain experts of the grammar-definition field, thuscomplex grammar model may become an obstacle ofwide adoption.

Additionally, in OOSM, the content of an ELE-MENT is not strictly typed. An ELEMENT simplyattaches some form of meaning to the content sur-rounded by it. That is, it is simply a tag. Because ofthe diversity of Web contents, the researcher wouldlike to claim that a strictly typed grammar modelcan be difficult to adopt. On the other hand, an EL-EMENT itself, that is, the namespace URI and localname pair, already provides a good amount of infor-mation for the content associated with it. There aredrawbacks, of course, since post-processing will beneeded to extract information from an ELEMENT.However, an OOSM is just like a relation definition ina relational database and performing post-processingto extract information from a relation definition (forexample, issue a sql query and then post-process re-sults returned from the sql query) is not uncommon.

3.1.2 Annotation sub componentsSchemas must be associated with data to work.

Annotation sub components are designed for specify-ing such associations. As shown above, in OOSM, anELEMENT annotates a portion of a Web page. ForWeb users, the steps of annotating a Web page arethus:

1. choose a ROOT ELEMENT from the schema cate-gory; the ROOT ELEMENT thus identifies a spe-cific schema and represents a specific concept

2. annotate the Web page with ELEMENTs definedin the selected schema according to rules includedin the schema; it is not required to fully anno-tate a Web page, that is, some ELEMENTs canbe skipped, and values of these ELEMENTs aresimply null

3. note that chained annotations can be achievedif an ELEMENT defined in a schema is theROOT ELEMENT of another schema

Compared with existing tagging systems, it seemsas if the proposed annotating mechanism restrictsWeb users freedom of choosing annotations for Webresources. However, the design makes annotation in-formation more usable. By adhering to rules specifiedin the selected schema, annotated Web resources canbe further processed by software modules or domainexperts. Furthermore, despite of the additional con-straints, the simplicity in the schema structure mini-mizes the inconveniences.

The proposed annotation system should performwell if only presentational or nominal information isneeded. However, the system may be insufficient forextraction detailed information. For example, prob-lems can occur if an application has to extract nu-meric information for performing calculation since thecontents of ELEMENTs are not strongly typed. Toremedy this, further postprocessing can be needed.Therefore, aggregating tools are also included in theproposed platform.

3.1.3 Aggregating ToolsAggregating tools support the following opera-

tions: data extraction/normalization, aggregation,join, and projection. Data extraction operations con-struct datacentric views for sets of Web resources.Here, the researcher would like to distinguish be-tween presentation-centric and data-centric view ofWeb resources. Presentation technologies, such asHTML and CSS constitute the presentation-centricview. For ordinary Web users, this is the only viewin their daily Web usages. Presentation-centric viewis appealing to human beings but difficult to processfor software agents. On the other hand, data-centricview is more valuable to software agents since it con-sists of meaningful information extracted from Webpages. In the proposed system, annotation informa-tion contributed by Web users is utilized. Annotationinformation along with the annotated data and theassociated OOSM forms the datacentric view of thespecified set of Web resources.

Data extraction cannot be done without data nor-malization. Data on Web pages can be incomplete,inconsistent, and noisy. For example, to demonstratethe same piece of information, different Web pagescan use different formats or different units of mea-

Page 5: Crowd Aided Web Search: Concept and Implementation · crowd search and crowd sourcing is social bookmark-ing. As shown in Heymanns research work [5], social bookmarking is a recent

Crowd Aided Web Search: Concept and Implementation 5

surement. Even with grammar information, such is-sues can still exist. One solution is to define verydetailed grammar rules. With such rules, users canspecify very detailed information such as formats orunits and thus no ambiguity can happen. However,such type of grammar model will be too complicatedto be used by ordinary Web users, not to mentionwide adoption. To keep the grammar model lean andto prevent ambiguity, using data normalization mech-anisms to fix issues within data on Web pages maybe preferred.

In the proposed project, in addition to the dataextraction layer, a data normalization layer is also in-cluded in the aggregation tool. Web users specify nor-malization rules by writing or reusing simple scripts.Several types of script functions are pre-built includ-ing: object graph traversing functions, tag cleaningfunctions, type conversion functions, and primitivedata manipulation functions.

After data extraction, the original html page willbe converted to a JSON-like object graph. Hence,object graph traversing functions are designed for ac-cessing properties on the graph. A property maystore a DOM node. If only the text portion of DOMnode is needed, tag cleaning functions can be usedto remove all non-text nodes. Type conversion func-tions convert texts to other primitive-typed data. Af-ter that, the data can be processed by primitive datamanipulation functions.

Aggregation operations definitely play an impor-tant role in data processing. These operations be-come available only after data being extracted andnormalized, that is, they can only be applied to data-centric views of Web resources. In the proposedproject, several aggregation operations such as min(),max(), sum(), and avg(), etc. are defined. With theseaggregation functions, Web users can define a set ofrelated Web resources and get further informationfrom them.

Regarding to data-centric views of Web resourcesas tables in relational models, the join operationshould also be available. With the join operation,cross cut analysis among Web resources can be per-formed. Such analysis is valuable if joined informa-tion across Web resources associated with differentgrammars is desired. For example, with the oper-ation, information from news sites and travel agentsites can be integrated together. The join operatorintroduces new grammar models by merging partici-pating models.

Here, the projection operation refers to the actionthe selects a partial set of ELEMENTs defined in thesame schema from different Web resources. The op-eration is required since in some information sources,only part of information defined in the included Webpages will be needed.

3.2 Search Tool

It is without doubt that how to integrate the pro-posed platform with existing search engines can bean issue. In this research, a set of prototyping searchtools are implemented. The tool set includes a searchengine core and a front-end tool. Instead of build-ing a generic Web search engine on our own, we uti-lize the functionalities provided by modern search en-gines. The proposed system works in the followingway:

1. users define advanced information sources andpublish them to the Web; a published advancedinformation source has to declare Web resourcesthat can be processed by it

2. search request redirected to integrated genericWeb search engine; the query will be executed anda list of returned Web pages will be intercepted byour tool

3. urls of returned Web pages will be used to deter-mine the set of advanced information sources thatcan be used; only applicable advanced informationsources should be contacted

4. use user-preference and accumulated scores givenby the crowd to determine the target set of ad-vanced information sources

5. search engine users then have two sets of re-turned results: one from the integrated genericWeb search engine and the other from the ad-vanced information sources selected

6. users utilize the front-end tool to navigate withinthe returned set of results and provide feedback tothe search engine core

The functionalities provided by the search enginecore and the front-end tool will be explained in thefollowing sub sections respectively.

The search engine core handles queries. Due to thespecialty of the proposed platform, the search enginecore has to touch two different sets of informationsources. One is the generic Web and the other is ad-vanced information sources built by crowds. There-fore, for a single query, the search engine core gener-ates two sets of results: generic results and aggregatedresults. The success of existing Web search engines inprocessing generic Web pages makes it unreasonableto re-invent the wheel. As a result, we focus on ag-gregated results while leave generic results to genericWeb search engines, i.e., for the generic part, we sim-ply use the application interfaces provided by genericWeb search engines. In addition to querying againstcrowd-made information sources, the search enginecore should cache the results generated by them formaximizing performance. Aggregated results are or-dered according to the scores recorded in the feed-back system. The order of generic results is basicallypreserved except that Web pages referenced by ag-gregated results will be prioritized and thus will havehigher ranks.

Page 6: Crowd Aided Web Search: Concept and Implementation · crowd search and crowd sourcing is social bookmark-ing. As shown in Heymanns research work [5], social bookmarking is a recent

6 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.9, NO.1 May 2015

The front-end tool provides several important op-erations. First, users use its user interface to issuequeries. Once query results are returned, besides nav-igating among them, users can change query criterionby modifying keywords or adjusting categories. Oncedifferent keywords are issued, a new query session willbe started. On the other hand, adjusting categoriesresults in re-querying matched user-defined informa-tion sources. Since the search engine core prioritizesgeneric search results referenced by results from user-defined information sources, adjusting categories re-sults in reordering of generic results, too. One thingto be noted is that this step will not result in issu-ing queries to integrated generic Web search engineagain. Furthermore, the front-end tool is also used forgiving feedbacks. There are three types of feedbacks:tagging, category selection, and giving scores. Feed-backs will be used for ordering results returned fromquerying advanced information sources and will indi-rectly affects the order of results from the integratedgeneric Web search engine. The feedback system willbe explained in detail in the next section.

4. SYSTEM DESIGN AND EXAMPLE

For proof of concept, a set of tools has been imple-mented. They can be roughly divided into two cate-gories: the data processing tools and the search tools.The former contains a schema construction tool, theOOSM Builder, and an annotation tool, the OOSMMapper. The latter is mainly a database. In this sec-tion, these tools along with a working example willbe presented.

4.1 Design of Data Processing Tools

The figure below demonstrates our OOSM builderapplication:

Fig.2: OOSM Builder Application.

OOSM Builder is used for designing schemas. Asshown above, the schema in design is organized as alist of rules. In the demonstrated schema, the rootelement, {test}test1, consists of two named rules:{test}root and {test}a. Each rule is then presentedas a list of element/element lists. For example, rule{test}a illustrated in figure 2 contains an element list

that is named as {test}lists1. Note that in OOSM,a list is actually a repeatable, optional group of el-ements. The result schema listed below is stored inthe JSON format:

{“rootElement” : {“name” : {“namespaceURI” : “test”,“localPart” : “root”,“prefix” : “”

}},“rules” : [{“headingElement” : {“name” : {“namespaceURI” : “test”,“localPart” : “root”,“prefix” : “”

}},“constructs” : [{“name” : {“namespaceURI” : “test”,“localPart” : “a”,“prefix” : “”

}}]

},{“headingElement” : {“name” : {“namespaceURI” : “test”,“localPart” : “a”,“prefix” : “”

}},“constructs” : [{“name” : {“namespaceURI” : “test”,“localPart” : “b”,“prefix” : “”}

},{“name” : {“namespaceURI” : “test”,“localPart” : “c”,“prefix” : “”}

},{“elements” : [{“name” : {

Page 7: Crowd Aided Web Search: Concept and Implementation · crowd search and crowd sourcing is social bookmark-ing. As shown in Heymanns research work [5], social bookmarking is a recent

Crowd Aided Web Search: Concept and Implementation 7

“namespaceURI” : “test”,“localPart” : “d”,“prefix” : “”}

},{“name” : {“namespaceURI” : “test”,“localPart” : “e”,“prefix” : “”

}}],“name” : {“namespaceURI” : “test”,“localPart” : “list1”,“prefix” : “”

}}]

}],“name” : {“namespaceURI” : “test”,“localPart” : “test1”,“prefix” : “”},“description” : “description”

}

OOSM Mapper is used for annotating a HTMLsource against a selected schema. The figure belowshows the user interface of OOSM Mapper:

Fig.3: OOSM Mapper Application.

OOSMMapper is designed around the project-centricconcept. A project contains an OOSM file and aHTML document. Users either specify the file pathor the URL of the HTML documents to be anno-tated. If a URL is specified, OOSM Mapper willautomatically fetch the document from the specifiedURL. Note that, a normalization process will be per-formed against the retrieved document to avoid com-mon HTML errors.

To annotate the target HTML document, one sim-ply selects a schema node from the left panel and aHTML node from the right panel, and then triggerthe binding dialog shown in figure 4. In the bindingdialog, there are three input fields: ID, Definition,and Target. The ID field shows the id of the cur-rent binding, which is managed by OOSM Mapperautomatically. The Definition field shows the schemanode used for annotating the selected HTML node.The Target field shows the specification of the anno-tated target node. By default, the Target field showsthe xpath of the selected HTML node. However, userscan enter any valid xpath expression here, includingxpath functions. The design will be useful when it isinadequate to annotate a whole node. For example,users may want to annotate a schema node to only asub string of the selected HTML node.

Fig.4: The Binding Dialog.

Finally, to view the extracted results, one simplyopens the show result dialog.

4.2 Design of Search Tool

A naıve approach to design a search tool willbe simply inserting the extracted data into a rela-tional database. Then, plain SQL queries can beused for searching. However, the dynamic nature ofthe Web makes the design of schemas of relationaldatabases difficult. Additionally, the possibly hugeamount of extracted information can make relationaldatabases too slow to be usable. Furthermore, theextracted result is wrapped into an instance of theelaborate.tag analysis.oosm.instance.binding. Evalu-atedObject class, which in fact contains a map of (key,value) pairs. The EvaluatedObject can in turn gener-ate results in either HTML or JSON format. Hence, amore natural way is to put extracted information intoa NoSQL database and rely on the search capabilitiesprovided by the database. In this research, the Mon-goDB database is chosen as the underlying database.MongoDB accepts JSON-style documents and pro-vides a rich set of query functionalities that can fitour requirements. Last but not the least, the perfor-

Page 8: Crowd Aided Web Search: Concept and Implementation · crowd search and crowd sourcing is social bookmark-ing. As shown in Heymanns research work [5], social bookmarking is a recent

8 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.9, NO.1 May 2015

mance of MongoDB is good and it supports replicas,which is essential for dealing with a very large amountof data.

4.3 Example

Here, a working example is used to fur-ther clarify our concept. The example isbased on the smartphone ontology obtained fromhttp://www.productontology.org/. The originallyschema is in RDF format, which is more complexthan OOSM, so we simplified the schema by removingthe relation information in the schema and convertedit into OOSM format. Then, the Wikipedia page foriphone is chosen as the target document. The schemais shown below:

Fig.5: The (Modified) SmartPhone Schema.

As shown above, reviews and features are designedas element lists since there can be more than one ofthem. Then, we designed the bindings. Since the textnodes of the Wikipedia page are usually long para-graphs, we have to use the substring xpath functionto extract needed information from the page.

4.4 Search Example

For concept-proof, a sample search engine thatis shipped with the smartphone schema presentedabove has been built. In the current stage, thefunctionality of selecting the desired informationsource is not implemented, so users have to selecttheir target information sources manually. The cur-rent implementation supports both full-text queriesand attribute-oriented queries. The search engineis hosted by our own Web server and the url is“http://ilab1.twgogo.org:8080/OOSMWebSearchCli-ent/s martphone/index.jsp. The figure below showsthe homepage:

When a full-text query like “iphone is received,our search engine simply looks into the “descriptionelement of the smartphone schema. On the otherhand, we rely on the underlying mongo database forattribute queries. With schema information, userscan issue powerful queries such as:

Fig.6: The OOSM Smartphone Search Page.

{“{oosmtest}weight” : {$lt : 130}}

The above query can be used for looking for asmart phone with weight less than 130 gram. Theresult is shown below:

Fig.7: Result of the Sample Query.

Furthermore, clicking the “See Detail link will dis-play the original JSON document of the selected en-try. Compared with the google search engine, ben-efiting from the schema information, the proposedmethod not only provides richer query functional-ities through attributeoriented queries but also re-turns more precise information when the target do-main is known in advance.

5. DISCUSSIONS

5.1 Integration with Existing Search Engines

Today, there are already several big companiesdominating the search engine market. Re-inventingthe wheel, that is, building our own search enginesis not practical. Instead, the goal of this researchis to complement existing search engines. The coreconcept of this research is that advanced informa-tion sources, information sources made by domain ex-perts, are helpful for domain-specific search. The pro-vided tools simplify the construction of advanced in-formation sources. When the target domain is knownclearly, users simply point to the desired informationsources managed by our system for querying. Thenext goal is to build a recommendation system forappropriate information sources. When users submit

Page 9: Crowd Aided Web Search: Concept and Implementation · crowd search and crowd sourcing is social bookmark-ing. As shown in Heymanns research work [5], social bookmarking is a recent

Crowd Aided Web Search: Concept and Implementation 9

queries, the recommendation system extracts labelsfrom result lists returned by search engines and uselabels for suggesting suitable information sources.

5.2 Feedback and Analytic System

Feedbacks from users are used for determining theordering of search results. The current implementa-tion provides three types of feedbacks to interact withend users: tagging, category-re-selection, and scor-ing. Tagging reflects what users think an aggregateor generic result is about. By collecting tags providedby users and performing analysis, possible categoriza-tions of Web resources can be inferred. Tagging hastwo different types of effects: direct and indirect. Forordinary users, assigning tags to plain Web pages isthe natural behavior. The direct effect of tag assign-ing is to determine which categories a Web page be-longs to. However, since assigning new tags to a Webpage may change the current categorization results,tag assigning also has indirect effect on informationsources containing the Web page.

A problem of the auto-categorizing system men-tioned above is that it is very difficult if not impossi-ble to make the result accurate. As a result, allowingevolving of categorizes becomes an important factorof success of categorization systems. The indirect ef-fect mentioned above forms the category-re-selectionprocess and can be viewed as a complementary mech-anism. Once a user decides to switch to a differentcategory, the action reveals that the user does notagree with the keyword-category association recordedin the system. Hence, the information should beused to adjust the automatically generated catego-rizations. This complementary system will be benefi-cial to the categorization accuracy, but the adoptionof the information should be conservative since be-haviors of Web users can be very diverse.

Additionally, a straightforward scoring system is alsoimplemented. The scoring system is only applied touserdefined advanced information sources since wehave no intention to mess up with generic Web searchmechanisms. However, as stated in former sections,the ordering of aggregated results will affect the or-dering of generic results, only that such ordering isjust presentational and the internal ordering recordedby the integrated Web search engine will be kept un-touched.

5.3 DOM Modification Problem

We can roughly categorize annotation systems intotwo types: internal annotation system and externalannotation system. The former requires annotationsto be directly embedded into Web pages. Such an-notation systems are more stable since whenever aWeb page is modified, annotations associated withit can also be adjusted to fit the modification. How-ever, internal annotation system can be inconvenient.

The major problem is it affects the way ordinary Webusers authoring Web pages. On the other hand, thelatter approach is usually more adoptable. With ex-ternal annotation systems, Web pages themselves arekept untouched. Besides, another benefit of exter-nal annotation systems is they allow multiple sets ofannotations to be attached to the same Web page.

However, such annotation systems typically sufferfrom the DOM modification problem. That is, oncethe structure of a Web page is changed, the appliedannotations can become invalid since usually a DOMpath is used for associating an annotation to a piece ofthe Web page. Web pages are modified frequently es-pecially for those dynamically generated pages. Thesituation makes it really difficult to develop and main-tain external annotation systems.

In the proposed mechanism, a special schema sys-tem named as OOSM is adopted. By creating map-pings between an OOSM and a Web page, the an-notation process is accomplished. The mapping pro-cess does not require positioning annotations accu-rately. The nature of OOSM makes it easier to dealwith DOM modification since it allows annotatinga sub tree in a DOM. The design can cause prob-lems since it produces weaklytyped results. Hence,to make OOSM effective, data processing tools men-tioned above is required. Compared with the DOMmodification problem, the researchers would like toclaim that OOSM is still valuable since the weakly-types issue can be remedied with scripting function-alities.

6. CONCLUSIONS AND FUTURE WORK

In this paper, a Web search mechanism elaborat-ing crowd wisdom is proposed. Our approach pro-vides a way for users to define advanced informa-tion sources, which are just like views of relationaldatabases. Based on advanced information sources,some more feature-rich search operations are also pro-vided. The researcher claims that the proposed mech-anism is more efficient and powerful than ordinarykeyword-based search. The current implementationsare mostly desktop-based GUI applications. In thefuture, it is planned to port the current implemen-tations to be Web-based for easier integration andadoption.

ACKNOWLEDGEMENT

First, I have to acknowledge NSCs support for thecompletion of this research. Furthermore, the paperwas prepared in collaboration with my lab membersin Department of Information Management, NanhuaUniversity, Taiwan. I would like to acknowledge thefollowing persons who have made the completion ofthis research possible: Chu-Chun Chuang, Jia-HuaWu, Han-Ci Syu, and Yan-Ru Jiang.

Page 10: Crowd Aided Web Search: Concept and Implementation · crowd search and crowd sourcing is social bookmark-ing. As shown in Heymanns research work [5], social bookmarking is a recent

10 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.9, NO.1 May 2015

References

[1] J. Franklin, D. Kossmann, T. Kraska, S. Rameshand R. Xin, “CrowdDB: answering queries withcrowdsourcing,” In Proceedings of the 2011 ACMSIGMOD International Conference on Manage-ment of data,pp. 61-72, 2011.

[2] A. Bozzon, M. Brambilla and S. Ceri, “Answer-ing search queries with CrowdSearcher,” In Pro-ceedings of the 21st international conference onWorld Wide Web, pp. 1009-1018, 2012.

[3] A. Parameswaran, H. Garcia-Molina, H. Park,N. Polyzotis, A. Ramesh and J. Widom, “Crowd-Screen: algorithms for filtering data with hu-mans,” In Proceedings of the 2012 ACM SIG-MOD International Conference on Managementof Data, pp. 361-372, 2012.

[4] A. Bozzon, M. Brambilla, S. Ceri and P. Frater-nali, “Liquid query: multi-domain exploratorysearch on the web,” Proceedings of the 19th in-ternational conference on World wide web, pp.161-170, 2010.

[5] P. Heymann, G. Koutrika and H. Garcia, “Cansocial bookmarking improve web search?,” InProceedings of the 2008 International Conferenceon Web Search and Data Mining, pp. 195-206,2008.

[6] A. Parameswaran, A. Sarma, H. Garcia-Molina,N. Polyzotis, and J. Widom, “Human-assistedgraph search: it’s okay to ask questions,” TheProceedings of the VLDB Endowment,vol.4, pp.267-278, 2011.

[7] Amazon, https://www.mturk.com/mturk/welcome

[8] D. Konopnicki and O. Shmueli, “Database-inspired search,” In Proceedings of the 31st in-ternational conference on Very large data bases,pp. 2-12, 2005.

[9] Yahoo, Yahoo! Query Language,http://developer.yahoo.com/yql/

[10] R. Trillo, L. Po, S. Ilarri, S. Bergamaschi andE. Mena, “Using semantic techniques to accessweb data,” Information Systems 36, pp. 117-133,2011.

[11] S. Chun and J. Warner, “Semantic Annotationand Search for Deep Web Services,” In Proceed-ings of the 2008 10th IEEE Conference on E-Commerce Technology and the Fifth IEEE Con-ference on Enterprise Computing, E-Commerceand EServices, pp. 389-395, 2008.

[12] N. Sarkas, S. Paparizos and P. Tsaparas, “Struc-tured annotations of web queries,” In Proceed-ings of the 2010 ACM SIGMOD InternationalConference on Management of data, pp. 771-782,2010.

[13] S. Paparizos, A. Ntoulas, J. Shafer and R.Agrawal, “Answering web queries using struc-tured data sources,” In Proceedings of the 2009

ACM SIGMOD International Conference onManagement of data, pp. 1127-1130, 2009.

Chun-Hsiung Tseng received his B.S.in computer science from National Na-tional ChengChi University, and re-ceived both M.S. and Ph.D. in computerscience from National Taiwan Univer-sity. He was a research assistant of Insti-tute of Information Science, AcademiaSinica in 2003-2010. He was a fac-ulty member of Department of Com-puter Information and Network Engi-neering, Lunghwa University of Science

and Technology in 2010-2013. His current position is a facultymember of Department of Information Management, NanhuaUniversity. His research interests include big data analysis,crowd intelligence, e-learning systems, and Web informationextraction.