index structures for querying the deep web

Index Structures for Querying the Deep Web

Jian Qiu, Feng Shao, Jayavel ShanmugasundaramCornell Universersity

Misha Zatsman Google

Deep WebKeyword queries Static web pages

Surface web

Deep WebKeyword queries Static web pages

Surface web

Ebaydatabases

CNNdatabases

Cars.comdatabases

…Amazon

databases

www.ebay.com

400-500 times the

size of surface

web!Deep web…

Deep GlueStructured queries Query results

Ebaydatabase

CNNdatabases

Cars.comdatabase

…Amazondatabase

400-500 times the

size of surface

web!Deep web

Deep Glue System

Query Engine

Find textbooks with price<$50

Database Concepts @ half.com…

Query

Superset of relevant data sources

Internet

…Half.com databases

Index structures

Indexer

Our focus

Index structure for deep web: Challenges

Deal with structured dataUnderlying databases are structuredSurface web typically unstructured

Deal with large volumes Orders of magnitude larger than the size of surface web

Our approach

Understand the structure/typing of the data

Support equality and range queries

Heavily compress the index Achieve a factor of 10 compression

Tradeoff between compression factor and the number of false positives

Compression factor 10 with only ~10 false positives for 1000 data sources.

Outline

Query model Index Structures Experimental Evaluation Related work and conclusion

Assumptions

Data sources are classified into domains Online car dealers, online auctions, online travel agents, …

Data sources in the same domain use same logical relational schema

Indexing attributes Price, date, make, model, isbn,… Indexed by Deep Glue system

Indexing data can be obtained via Crawling the deep web [Raghavan 01 ] Previously agreed-upon protocols [Froogle]

Query Model

Support equality and range queries currently on a single indexing attribute

Schema: Car(Id,Make,Model,Year,Price)

Queries: Find all year 2003 cars, year = 2003 Find all cars that cost less than $1000,

price < 1000

Outline


Overview

Uncompressed Index

Compressed Index, still support equality and range queries Value Clustered Index (VCI) DataSource Clustered Index (DCI) Value DataSource Clustered Index

(VDCI) Histogram Based Index (HBI)

Uncompressed Index (UI)

For each distinct value v for an indexing attribute, stores the list of data sources

d1,d7,d86d2,d3,d65d1,d4,d54

d43d2,d3,d4,d62

d6,d7,d81data sourcesvalue

value

data source

d1

d2

d3

d4

d5

d6

d7

d8

1 X X X

2 X X X X

3 X

4 X X X

5 X X X

6 X X Xd1: ebay.com , d2: amazon.com …

UI:

B+tree

Problems

A huge number of values and data sources in deep web !! Indexing every indexing attribute

requires space

Need to compress UI ! Use gzip?

Have to uncompress the index index lookup too expensive!

Need new compression techniques

Overview

Uncompressed Index

Compressed Index, still support equality and range queries Value Clustered Index (VCI) DataSource Clustered Index (DCI) Value DataSource Clustered Index


Value Clustered Index (VCI)

Intuition: “closely related” values are stored in

“closely related” data sources ISBN numbers of antique books in the online

book retailers specializing in antique books.

Cluster “closely related” values

Stores the list of data sources only for each cluster

VCI Example

value

data source

d1

d2

d3

d4

d5

d6

d7

d8

1 X X X

2 X X X X

3 X

4 X X X

5 X X X

6 X X X X

Cluster 1: { 1, 6}

Cluster 2: { 2, 5}

Cluster 3: { 3, 4}

False positivesvalue 1 data source d1

Tradeoff between space and accuracyMapping all values in one cluster

Mapping each distinct value into a separate cluster

c16c25c34c33c22c11

Cluster id

value

d1,d4,d5c3

d2,d3,d4,d6c2

d1,d6,d7,d8c1

data sourcesCluster id

VCI structures:Union

B+tree

VCI Implementation

Use existing scalable algorithm Scales to large data sets: Birch Framework

[Zhang96]

Minimize the number of false positives Specify the parameters for Birch

Centroid, the mid-point of a cluster Radius, a measure of quality for a cluster Distance between clusters

Centroid

Radius

Distance

cluster1

cluster2

VCI formulae For a cluster having the set of values V

ds(v): the set of data sources for value v

centroid(V) =

radius(V) =

distance(V1, V2) Additional number of false positives when merging two clusters

Vv

vds

)(

V

vdsVcentroidVv

)()(

)2()1(2)1()2(1 VcentroidVcentroidVVcentroidVcentroidV

Data sources associated with the cluster

Sum of number of false positives

Overview

Uncompressed Index

Compressed Index: Value Clustered Index (VCI) DataSource Clustered Index (DCI) Value DataSource Clustered Index


DataSource Clustered Index (DCI) Intuition: “closely related” data sources may have “closely related” sets

of values Amazon and b&n has similar sets of ISBN numbers

In the data graph, VCI clusters rows and DCI clusters columns

value

data source

d1

d2

d3

d4

d5

d6

d7

d8

1 X X X

2 X X X X

3 X X X X

4 X X X

5 X X X

6 X X X

Cluster 1: {d2,d3,d6}

Cluster 2: { d4, d5}

Cluster 3: { d1, d7, d8}

Table structures are similar to VCI.

See paper for other details

Overview

Uncompressed Index



Value-DataSource Clustered Index (VDCI)

VCI, DCI: clusters in 1 dimension VDCI: clusters in 2 dimensions, generalizes VCI/DCI Cluster: a set of values and a set of data sources

value

data source

d1

d2

d3

d4

d5

d6

d7

d8

1 X X X

2 X X X X X X

3 X X X X

4 X X X X

5 X X X X X

6 X X X

Cluster 1:{ {2,3}, {d2,d3,d4}}

Cluster 2:{ {4,5}, {d4,d5,d6} }

Cluster 3:{ {1,2}, {d6,d7,d8} }

Data source d4 is in two clusters

Value 2 is in two clusters

Table structures are similar to VCI.

See paper for other details

Overview

Uncompressed Index



Histogram Based Index (HBI)

VCI/VDCI don’t consider the ordering among values Range queries implies this need

HBI groups adjacent values in the same cluster

Also need to ensure the accuracy Use threshold to determine the boundary of a

cluster Threshold: average number of false positives in a cluster

HBI Example

value

data source

d1 d2 d3 d4 d5 d6 d7 d8

1 X X X

2 X X X X

3 X

4 X X X

5 X X X

6 X X X

Threshold: 2

Cluster adjacent values

Cluster 1: {1}

Cluster 2: {2,3,4}

Cluster 3: {5,6}

Outline


Experimental setupSynthetic data

1000 data sources, 100,000 values, 4,000,000 (value,data source) pairs

Other parameters are in the paper

MetricsIndex creation timeCompression factorFalse positives

Setup 2.8GHz Pentium IV, 1GB memory, 80GB disk C++

Index creation time

Index structure Time(min)

UI 0.25

VCI 15

DCI 3

VDCI 180

HBI 2.5

Equality queries (1000 data sources)

0

10

20

30

40

50

0 5 10 15 20compression factor

false

posit

ives

VCI DCI VDCI HBI

Range Queries (1000 data sources)

0

5

10

15

20

25

30

35

40

45

50

0 5 10 15 20compression factor

fals

e po

sitiv

es

VCI DCI VDCI HBI

Outline


Related work

Distributed database & information integration Niagara system [Naughton01] GlOSS [Gravano99] …

Database/Inverted list compression Query Optimization in Compressed Databases [Chen 01] Compressing the Relations and Index [Goldstein 98] Improved Query Performance with Variant Indices

[O’Neill 97] Implementation and Performance of Compressed

Databases [Westmann 00] Size Reduction of Inverted Files [Weiss 90] …

Conclusion

Space-efficient index structures for querying the deep web

Support equality and range queries A factor of 10 compression with a little

loss in precision

Future work Combine cluster-based and histogram-based Multiple attributes queries Joins Incremental index maintenance

Questions?

Experimental setupOther parameters:

Number of groups The data sources in the same group use same distribution to generate the values

Default 20

Group mode How many groups a data source belongs to

Default 1

Value correlation How the orders in the value space maps to the value ordering over which Gaussian distribution is used.

Default 0.2

index structures for querying the deep web

Documents

index index lookup

deep web raghavan

data support equality

deep glue systemindexing

structuredsurface web

data source d1tradeoff

deep webjian qiu

compression tradeoff