keyword search in databases

Click here to load reader

Post on 12-Sep-2021

0 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

exp7_3_2.epsEditor M. Tamer Özsu
Keyword Search in Databases Jeffrey Xu Yu, Lu Qin, and Lijun Chang 2010
Copyright © 2010 by Morgan & Claypool
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews, without the prior permission of the publisher.
Keyword Search in Databases
www.morganclaypool.com
DOI 10.2200/S00231ED1V01Y200912DTM001
A Publication in the Morgan & Claypool Publishers series SYNTHESIS LECTURES ON DATA MANAGEMENT
Lecture #1 Series Editor: M. Tamer Özsu Series ISSN Synthesis Lectures on Data Management ISSN pending.
Keyword Search in Databases
Jeffrey Xu Yu, Lu Qin, and Lijun Chang Chinese University of Hong Kong
SYNTHESIS LECTURES ON DATA MANAGEMENT #1
CM& cLaypoolMorgan publishers&
ABSTRACT It has become highly desirable to provide users with flexible ways to query/search information over databases as simple as keyword search like Google search.
This book surveys the recent developments on keyword search over databases, and focuses on finding structural information among objects in a database using a set of keywords. Such structural information to be returned can be either trees or subgraphs representing how the objects, that contain the required keywords, are interconnected in a relational database or in an XML database. The structural keyword search is completely different from finding documents that contain all the user-given keywords. The former focuses on the interconnected object structures, whereas the latter focuses on the object content.
The book is organized as follows. In Chapter 1, we highlight the main research issues on the structural keyword search in different contexts. In Chapter 2, we focus on supporting structural keyword search in a relational database management system using the SQL query language. We concentrate on how to generate a set of SQL queries that can find all the structural information among records in a relational database completely, and how to evaluate the generated set of SQL queries efficiently. In Chapter 3,we discuss graph algorithms for structural keyword search by treating an entire relational database as a large data graph. In Chapter 4, we discuss structural keyword search in a large tree-structured XML database. In Chapter 5,we highlight several interesting research issues regarding keyword search on databases.
The book can be used as either an extended survey for people who are interested in the structural keyword search or a reference book for a postgraduate course on the related topics.
KEYWORDS keyword search, interconnected object structures, relational databases, XML databases, data stream, rank
To my wife, Hannah my children, Michael and Stephen
Jeffrey Xu Yu
Lu Qin
ix
Contents
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3.1 Getting All MTJNT s in a Relational Database 22
2.3.2 Getting Top-k MTJNT s in a Relational Database 29
2.4 Other Keyword Search Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Graph-Based Keyword Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.1 Graph Model and Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Polynomial Delay and Dijkstra’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3 Steiner Tree-Based Keyword Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.1 Backward Search 53
3.3.2 Dynamic Programming 55
3.4 Distinct Root-Based Keyword Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4.1 Bidirectional Search 69
3.4.2 Bi-Level Indexing 71
3.5 Subgraph-Based Keyword Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5.1 r-Radius Steiner Graph 76
3.5.2 Multi-Center Induced Graph 78
x CONTENTS
4.1 XML and Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1.1 LCA, SLCA, ELCA, and CLCA 84
4.1.2 Problem Definition and Notations 87
4.2 SLCA-Based Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2.2 Efficient Algorithms for SLCAs 90
4.3 Identify Meaningful Return Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .98
4.3.1 XSeek 99
4.4.2 Identifying Meaningful ELCAs 111
4.5 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.1 Keyword Search Across Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.1.1 Selection of Databases 115
5.1.2 Answering Keyword Queries Across Databases 119
5.2 Keyword Search on Spatial Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.2.1 Points as Result 120
5.2.2 Area as Result 121
5.3 Variations of Keyword Search on Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.3.1 Objects as Results 123
5.3.2 Subspaces as Results 125
5.3.3 Terms as Results 126
5.3.4 sql Queries as Results 128
5.3.5 Small Database as Result 129
5.3.6 Other Related Issues 130
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Authors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Preface It has become highly desirable to provide flexible ways for users to query/search information by integrating database (DB) and information retrieval (IR) techniques in the same platform. On one hand, the sophisticated DB facilities provided by a database management system assist users to query well-structured information using a query language based on database schemas.Such systems include conventional rdbmss (such as DB2, ORACLE, SQL-Server), which use sql to query relational databases (RDBs) and XML data management systems, which use XQuery to query XML databases. On the other hand, IR techniques allow users to search unstructured information using keywords based on scoring and ranking, and they do not need users to understand any database schemas. The main research issues on DB/IR integration are discussed by Chaudhuri et al. [2005] and debated in a SIGMOD panel discussion [Amer-Yahia et al., 2005]. Several tutorials are also given on keyword search over RDBs and XML databases, including those by Amer-Yahia and Shanmugasundaram [2005]; Chaudhuri and Das [2009]; Chen et al. [2009].
The main purpose of this book is to survey the recent developments on keyword search over databases that focuses on finding structural information among objects in a database using a keyword query that is a set of keywords. Such structural information to be returned can be either trees or sub- graphs representing how the objects, which contain the required keywords, are interconnected in an RDB or in an XML database. In this book, we call this structural keyword search or, simply, keyword search. The structural keyword search is completely different from finding documents that contain all the user-given keywords. The former focuses on the interconnected object structures, whereas the latter focuses on the object content. In a DB/IR context, for this book, we use keyword search and keyword query interchangeably. We introduce forms of answers, scoring/ranking functions, and approaches to process keyword queries.
The book is organized as follows. In Chapter 1, we highlight the main research issues on the structural keyword search in
different contexts. In Chapter 2, we focus on supporting keyword search in an rdms using sql. Since this implies
making use of the database schema information to issue sql queries in order to find structural information for a keyword query, it is generally called a schema-based approach. We concentrate on the two main steps in the schema-based approach, namely, how to generate a set of sql queries that can find all the structural information among tuples in an RDB completely and how to evaluate the generated set of sql queries efficiently. We will address how to find all or top-k answers in a static RDB or a dynamic data stream environment.
In Chapter 3, we also focus on supporting keyword search in an rdbms. Unlike the approaches discussed in Chapter 2 using sql, we discuss the approaches that are based on graph algorithms by
xii PREFACE
materializing an entire database as a large data graph.This type of approach is called schema-free, in the sense that it does not request any database schema assistance. We introduce several algorithms, namely polynomial delay based algorithms, dynamic programming based algorithms, and Dijkstra shortest path based algorithms. We discuss how to find exact top-k and approximate top-k answers in a large data graph for a keyword query. We will discuss the indexing mechanisms and the ways to handle a large graph on disk.
In Chapter 4,we discuss keyword search in an XML database where an XML database is a large data tree. The two main issues are how to find all subtrees that contain all the user-given keywords and how to identify the meaning of such returned subtrees. We will discuss several algorithms to find subtrees based on lowest common ancestor (LCA) semantics, smallest LCA semantics, exclusive LCA semantics, etc.
In Chapter 5, we highlight several interesting research issues regarding keyword search on databases. The topics include how to select a database among many possible databases to answer a keyword query, how to support keyword query in a spatial database, how to rank objects according to their relevance to a keyword query using PageRank-like approaches, how to process keyword queries in an OLAP (On-Line Analytical Processing) context, how to find frequent additional keywords that are most related to a keyword query, how to interpret a keyword query by showing top-k sql queries, and how to project a small database that only contains objects related to a keyword query.
The book surveys the recent developments on the structural keyword search. The book can be used as either an extended survey for people who are interested in the structural keyword search or a reference book for a postgraduate course on the related topics.
We acknowledge the support of our research on keyword search by the grant of the Research Grants Council of the Hong Kong SAR, China, No. 419109.
We are greatly indebted to M.Tamer Özsu who encouraged us to write this book and provided many valuable comments to improve the quality of the book.
Jeffrey Xu Yu, Lu Qin, and Lijun Chang The Department of Systems Engineering and Engineering Management The Faculty of Engineering The Chinese University of Hong Kong December, 2009
1
C H A P T E R 1
Introduction Conceptually, a database can be viewed as a data graph GD(V, E), where V represents a set of objects, and E represents a set of connections between objects. In this book, we concentrate on two kinds of databases, a relational database (RDB) and an XML database. In an RDB, an object is a tuple that consists of many attribute values where some attribute values are strings or full-text; there is a connection between two objects if there exists at least one reference from one to the other. In an XML database, an object is an element that may have attributes/values. Like RDBs, some values are strings. There is a connection (parent/child relationship) between two objects if one links to the other. An RDB is viewed as a large graph, whereas an XML database is viewed as a large tree.
The main purpose of this book is to survey the recent developments on finding structural in- formation among objects in a database using a keyword query, Q, which is a set of keywords of size l, denoted as Q = {k1, k2, · · · , kl}. We call it an l-keyword query.The structural information to be re- turned for an l-keyword query can be a set of connected structures, R = {R1(V , E), R2(V , E), · · · } where Ri(V, E) is a connected structure that represents how the objects that contain the required keywords, are interconnected in a database GD . S can be either all trees or all subgraphs. When a function score(·) is given to score a structure, we can find the top-k structures instead of all structures in the database GD . Such a score(·) function can be based on either the text information maintained in objects (node weights) or the connections among objects (edge weights), or both.
In Chapter 2,we focus on supporting keyword search in an rdbms using sql.Since this implies making use of the database schema information to issue sql queries in order to find structures for an l-keyword query, it is called the schema-based approach. The two main steps in the schema-based approach are how to generate a set of sql queries that can find all the structures among tuples in an RDB completely and how to evaluate the generated set of sql queries efficiently. Due to the nature of set operations used in sql and the underneath relational algebra, a data graph GD is considered as an undirected graph by ignoring the direction of references between tuples, and, therefore, a returned structure is of undirected structure (either tree or subgraph).The existing algorithms use a parameter to control the maximum size of a structure allowed. Such a size control parameter limits the number of sql queries to be executed. Otherwise, the number of sql queries to be executed for finding all or even top-k structures is too large. The score(·) functions used to rank the structures are all based on the text information on objects. We will address how to find all or top-k structures in a static RDB or a dynamic data stream environment.
In Chapter 3, we focus on supporting keyword search in an rdbms from a different viewpoint, by treating an RDB as a directed graph GD . Unlike an undirected graph, the fact that an object v
can reach to another object u in a directed graph does not necessarily mean that the object v is
2 1. INTRODUCTION
reachable from u. In this context, a returned structure (either steiner tree, distinct rooted tree, r- radius steiner graph, or multi-center subgraph) is directed. Such direction handling provides users with more information on how the objects are interconnected. On the other hand, it requests higher computational cost to find such structures. Many graph-based algorithms are designed to find top- k structures, where the score(·) functions used to rank the structures are mainly based on the connections among objects. This type of approach is called schema-free in the sense that it does not request any database schema assistance. In this chapter, we introduce several algorithms, namely polynomial delay based algorithms, dynamic programming based algorithms, and Dijkstra shortest path based algorithms. We discuss how to find exact top-k and approximate top-k structures in GD
for an l-keyword query. The size control parameter is not always needed in this type of approach. For example, the algorithms that find the optimal top-k steiner trees attempt to find the optimal top-k steiner trees among all possible combinations in GD without a size control parameter. We also discuss the indexing mechanisms and the ways to handle a large graph on disk.
In Chapter 4, we discuss keyword search in an XML database where an XML database is considered as a large directed tree. Therefore, in this context, the data graph GD is a directed tree. Such a directed tree may be seen as a special case of the directed graph, so that the algorithms discussed in Chapter 3 can be used to support l-keyword queries in an XML database. However, the main research issue is different. The existing approaches process l-keyword queries in the context of XML databases by finding structures that are based on the lowest common ancestor (LCA) of the objects that contain the required keywords. In other words, a returned structure is a subtree rooted at the LCA in GD that contains the required keywords in the subtree, but it is not any subtree in GD that contains the required keywords in the subtree. The main research issue is to efficiently find meaningful structures to be returned. The meaningfulness are not defined based on score(·) functions. Algorithms are proposed to find smallest LCA, exclusive LCA, and compact LCA, which we will discuss in Chapter 4.
In Chapter 5, we highlight several interesting research issues regarding keyword search on databases. The topics include how to select a database among many possible databases to answer…

View more