scalable hybrid keyword search on distributed database

26
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on Autonomic Distributed Data and Storage Systems Management (ADSM 2005)

Upload: aubrey-allen

Post on 04-Jan-2016

43 views

Category:

Documents


1 download

DESCRIPTION

Scalable Hybrid Keyword Search on Distributed Database. Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on Autonomic Distributed Data and Storage Systems Management (ADSM 2005). Motivation. Where is the Information?. Internet. Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scalable Hybrid Keyword Search on Distributed Database

Scalable Hybrid Keyword Search on Distributed

Database

Jungkee KimFlorida State University

Community Grids Laboratory, Indiana University

Workshop on Autonomic Distributed Data and Storage Systems Management

(ADSM 2005)

Page 2: Scalable Hybrid Keyword Search on Distributed Database

Motivation

Internet

Where is the

Information?

Page 3: Scalable Hybrid Keyword Search on Distributed Database

Outline

Two Typical Search ParadigmsProblems of Current Search ApproachesLocal Hybrid Keyword SearchHybrid Search on Distributed Databases

Page 4: Scalable Hybrid Keyword Search on Distributed Database

Two Typical Search Paradigms

Searching over structured data

Relational Databases

Searching over unstructured data

Information Retrieval

Internet Environment

Semistructured Data – XML

Keyword Search in DB

Web Search Engines – Technologies from Information Retrieval

Hybrid Keyword Search ?

Page 5: Scalable Hybrid Keyword Search on Distributed Database

Current Approaches – Keyword-only Search

Web Search Engines Web crawlers visit Web pages and collect the

keyword based text indexes. Fast information retrieval

Keyword Search in databases Web integration on legacy DBMS Dynamic Web publication through embedded

DB Easy to use without knowledge of DB schema

Page 6: Scalable Hybrid Keyword Search on Distributed Database

Problems of Current Approaches – Keyword-based

Web Search Engines Can not collect every connected resource Query results are often unrelated

Keyword Search in Databases Losing the inherent meaning of the schema Query results are not based on semantic

schema

Page 7: Scalable Hybrid Keyword Search on Distributed Database

Current Approaches – Semantic

Semantic Web Multiple relation links with directed

labeled graphs and machines can understand the relationship between different resources

Describes metadata about resources To represent the relations of the objects

on the Web; the object terms defined under a specific description – an Ontology

Page 8: Scalable Hybrid Keyword Search on Distributed Database

Problems of Current Approaches – Semantic Web

Ontology design is sophisticatedLack of unified definition Limited adoption

Page 9: Scalable Hybrid Keyword Search on Distributed Database

Our Approach

Hybrid search mechanisms –Semantic metadata + Keyword search

Semantic SolutionSemantic Web might be better than Hybrid

search Hybrid search must be better than Web search

enginesSimplicity

Hybrid search is simpler than Semantic Web

Page 10: Scalable Hybrid Keyword Search on Distributed Database

Hybrid Keyword Search Service

A search service fetches target information data against a search query.

Unstructured dataA file containing data – MS Word, PDF, PS documents

Metadata: Structured or semistructured data – XML

We utilized an XML-enabled relational DBMS and a native XML DB along with a text search library (Apache Xindice + Jakarta Lucene) to address the search against metadata and text.

Page 11: Scalable Hybrid Keyword Search on Distributed Database

How to Combine? (1)

Two entity sets and a relationship in relational DBMS

We can obtain the hybrid search result using a nested subquery

Page 12: Scalable Hybrid Keyword Search on Distributed Database

How to Combine? (2)

A hash table is used for joining search results in non-DBMS based system (Apache Xindice + Lucene)

Page 13: Scalable Hybrid Keyword Search on Distributed Database

Local Query Processing – XML (1)

XML-enabled RDB DBLP XML record (1,000 – 10,000) Non indexed matches

except year match bound by the number of matches.

Combined query time depends on # of year query results

Average XML Query Time

Page 14: Scalable Hybrid Keyword Search on Distributed Database

Local Query Processing – XML (2)

Apache Xindice DBLP XML record (1,000 – 10,000) Indexed approximate

matches for text elements in XML instances as bad as non-indexed queries

Exact matches bound by the number of matches.

Average XML Query Time

Page 15: Scalable Hybrid Keyword Search on Distributed Database

Local Query Processing – Hybrid (1)

Hybrid search query performance measurement XML-enabled RDB For 100,000 XML instances and 100,000 text documents Small result set: 4 XML and a keyword matches Large result set: 7,752 XML and 41,889 documents (3,227)

Metadata Author Year

(Nested subquery)

Year

(Hash table)

Few

Keywords

0.04

Sec.

82.9 Sec. 5.70 Sec.

Many

Keywords

0.48

Sec.

Half hour 6.96 Sec.

Page 16: Scalable Hybrid Keyword Search on Distributed Database

Local Query Processing – Hybrid (2)

Hybrid search query performance measurement Apache Xindice + Jakarta Lucene For 10,000 XML instances and 10,000 text documents Small result set: 2 XML and a keyword matches Large result set: 192 XML and 4,562 documents (41)

Page 17: Scalable Hybrid Keyword Search on Distributed Database

Discussion – Local Hybrid Search

XML-enabled RDB provides proper response except some extreme query loads. Inefficient query plan and query optimization in an

old version – better performance in a newer version

A native XML DB (Apache Xindice) had very limited scalability. (No accurate query result over 16,000 XML instances)

We will generalize hybrid search to a distributed environment.

Page 18: Scalable Hybrid Keyword Search on Distributed Database

Hybrid Search on Distributed Databases Data Independence: logically and physically

independent; the same schema – no change, data encapsulation in each machine

Network Transparency: depends on MOM or P2P framework

No replication – restricted to a computer cluster Fragment: full partition; horizontal fragmentation The query result for the distributed databases is

the collection of query results from individual database queries.

Page 19: Scalable Hybrid Keyword Search on Distributed Database

Scalable Hybrid Search Architecture on DDBS

SearchService

MessageBroker

Client

SearchService

SearchService

Subscriber for a query topic

Publisher for a temporary topic

Publisher for a query topic

Subscriber for a temporary topic

QueryMessage

QueryMessage

ResultMessage

ResultMessage

Client Client

Page 20: Scalable Hybrid Keyword Search on Distributed Database

Cooperating Broker Network

Distributed Databases based on NaradaBrokering Network

Page 21: Scalable Hybrid Keyword Search on Distributed Database

Query Processing – DDBS (1)

100,000 XML and 100,000 Documents in 8 machines – 12,500 each

Few keyword match (1-3) on 1 machine only

RDB – 0.04 Sec. for few keyword match

Avg. response time for an author exact match queryover 8 search services

Page 22: Scalable Hybrid Keyword Search on Distributed Database

Query Processing – DDBS (2)

100,000 XML and 100,000 Documents in 8 machines – 12,500 each

RDB – half hour or 6.96 Sec. (Hash table)

Avg. response time for a year match queryover 8 search services

Page 23: Scalable Hybrid Keyword Search on Distributed Database

Coupling vs. Scalability

From ICDE 2002 Tutorial

Page 24: Scalable Hybrid Keyword Search on Distributed Database

Query Propagate and Results back on a P2P Network

Page 25: Scalable Hybrid Keyword Search on Distributed Database

Peer group architecture of the P2P Search

Page 26: Scalable Hybrid Keyword Search on Distributed Database

Conclusion

We addressed the semantic loss of keyword-only search while remaining a simpler solution than the Semantic Web

Our architecture contributed a performance improvement for some queries

Extension of the scalability of Xindice XML query limited to a small size on a single machine