index structures for querying the deep web
DESCRIPTION
Index Structures for Querying the Deep Web. Jian Qiu, Feng Shao , Jayavel Shanmugasundaram Cornell Universersity Misha Zatsman Google. Deep Web. Keyword queries. Static web pages. Surface web. Deep Web. Keyword queries. Static web pages. Surface web. - PowerPoint PPT PresentationTRANSCRIPT
Index Structures for Querying the Deep Web
Jian Qiu, Feng Shao, Jayavel ShanmugasundaramCornell Universersity
Misha Zatsman Google
Deep WebKeyword queries Static web pages
Surface web
Deep WebKeyword queries Static web pages
Surface web
Ebaydatabases
CNNdatabases
Cars.comdatabases
…Amazon
databases
www.ebay.com
400-500 times the
size of surface
web!Deep web…
Deep GlueStructured queries Query results
Ebaydatabase
CNNdatabases
Cars.comdatabase
…Amazondatabase
400-500 times the
size of surface
web!Deep web
Deep Glue System
Query Engine
Find textbooks with price<$50
Database Concepts @ half.com…
Query
Superset of relevant data sources
Internet
…Half.com databases
Index structures
Indexer
Our focus
Index structure for deep web: Challenges
Deal with structured dataUnderlying databases are structuredSurface web typically unstructured
Deal with large volumes Orders of magnitude larger than the size of surface web
Our approach
Understand the structure/typing of the data
Support equality and range queries
Heavily compress the index Achieve a factor of 10 compression
Tradeoff between compression factor and the number of false positives
Compression factor 10 with only ~10 false positives for 1000 data sources.
Outline
Query model Index Structures Experimental Evaluation Related work and conclusion
Assumptions
Data sources are classified into domains Online car dealers, online auctions, online travel agents, …
Data sources in the same domain use same logical relational schema
Indexing attributes Price, date, make, model, isbn,… Indexed by Deep Glue system
Indexing data can be obtained via Crawling the deep web [Raghavan 01 ] Previously agreed-upon protocols [Froogle]
Query Model
Support equality and range queries currently on a single indexing attribute
Schema: Car(Id,Make,Model,Year,Price)
Queries: Find all year 2003 cars, year = 2003 Find all cars that cost less than $1000,
price < 1000
Outline
Query model Index Structures Experimental Evaluation Related work and conclusion
Overview
Uncompressed Index
Compressed Index, still support equality and range queries Value Clustered Index (VCI) DataSource Clustered Index (DCI) Value DataSource Clustered Index
(VDCI) Histogram Based Index (HBI)
Uncompressed Index (UI)
For each distinct value v for an indexing attribute, stores the list of data sources
d1,d7,d86d2,d3,d65d1,d4,d54
d43d2,d3,d4,d62
d6,d7,d81data sourcesvalue
value
data source
d1
d2
d3
d4
d5
d6
d7
d8
1 X X X
2 X X X X
3 X
4 X X X
5 X X X
6 X X Xd1: ebay.com , d2: amazon.com …
UI:
B+tree
Problems
A huge number of values and data sources in deep web !! Indexing every indexing attribute
requires space
Need to compress UI ! Use gzip?
Have to uncompress the index index lookup too expensive!
Need new compression techniques
Overview
Uncompressed Index
Compressed Index, still support equality and range queries Value Clustered Index (VCI) DataSource Clustered Index (DCI) Value DataSource Clustered Index
(VDCI) Histogram Based Index (HBI)
Value Clustered Index (VCI)
Intuition: “closely related” values are stored in
“closely related” data sources ISBN numbers of antique books in the online
book retailers specializing in antique books.
Cluster “closely related” values
Stores the list of data sources only for each cluster
VCI Example
value
data source
d1
d2
d3
d4
d5
d6
d7
d8
1 X X X
2 X X X X
3 X
4 X X X
5 X X X
6 X X X X
Cluster 1: { 1, 6}
Cluster 2: { 2, 5}
Cluster 3: { 3, 4}
False positivesvalue 1 data source d1
Tradeoff between space and accuracyMapping all values in one cluster
Mapping each distinct value into a separate cluster
c16c25c34c33c22c11
Cluster id
value
d1,d4,d5c3
d2,d3,d4,d6c2
d1,d6,d7,d8c1
data sourcesCluster id
VCI structures:Union
B+tree
VCI Implementation
Use existing scalable algorithm Scales to large data sets: Birch Framework
[Zhang96]
Minimize the number of false positives Specify the parameters for Birch
Centroid, the mid-point of a cluster Radius, a measure of quality for a cluster Distance between clusters
Centroid
Radius
Distance
cluster1
cluster2
VCI formulae For a cluster having the set of values V
ds(v): the set of data sources for value v
centroid(V) =
radius(V) =
distance(V1, V2) Additional number of false positives when merging two clusters
Vv
vds
)(
V
vdsVcentroidVv
)()(
)2()1(2)1()2(1 VcentroidVcentroidVVcentroidVcentroidV
Data sources associated with the cluster
Sum of number of false positives
Overview
Uncompressed Index
Compressed Index: Value Clustered Index (VCI) DataSource Clustered Index (DCI) Value DataSource Clustered Index
(VDCI) Histogram Based Index (HBI)
DataSource Clustered Index (DCI) Intuition: “closely related” data sources may have “closely related” sets
of values Amazon and b&n has similar sets of ISBN numbers
In the data graph, VCI clusters rows and DCI clusters columns
value
data source
d1
d2
d3
d4
d5
d6
d7
d8
1 X X X
2 X X X X
3 X X X X
4 X X X
5 X X X
6 X X X
Cluster 1: {d2,d3,d6}
Cluster 2: { d4, d5}
Cluster 3: { d1, d7, d8}
Table structures are similar to VCI.
See paper for other details
Overview
Uncompressed Index
Compressed Index: Value Clustered Index (VCI) DataSource Clustered Index (DCI) Value DataSource Clustered Index
(VDCI) Histogram Based Index (HBI)
Value-DataSource Clustered Index (VDCI)
VCI, DCI: clusters in 1 dimension VDCI: clusters in 2 dimensions, generalizes VCI/DCI Cluster: a set of values and a set of data sources
value
data source
d1
d2
d3
d4
d5
d6
d7
d8
1 X X X
2 X X X X X X
3 X X X X
4 X X X X
5 X X X X X
6 X X X
Cluster 1:{ {2,3}, {d2,d3,d4}}
Cluster 2:{ {4,5}, {d4,d5,d6} }
Cluster 3:{ {1,2}, {d6,d7,d8} }
Data source d4 is in two clusters
Value 2 is in two clusters
Table structures are similar to VCI.
See paper for other details
Overview
Uncompressed Index
Compressed Index: Value Clustered Index (VCI) DataSource Clustered Index (DCI) Value DataSource Clustered Index
(VDCI) Histogram Based Index (HBI)
Histogram Based Index (HBI)
VCI/VDCI don’t consider the ordering among values Range queries implies this need
HBI groups adjacent values in the same cluster
Also need to ensure the accuracy Use threshold to determine the boundary of a
cluster Threshold: average number of false positives in a cluster
HBI Example
value
data source
d1 d2 d3 d4 d5 d6 d7 d8
1 X X X
2 X X X X
3 X
4 X X X
5 X X X
6 X X X
Threshold: 2
Cluster adjacent values
Cluster 1: {1}
Cluster 2: {2,3,4}
Cluster 3: {5,6}
Outline
Query model Index Structures Experimental Evaluation Related work and conclusion
Experimental setupSynthetic data
1000 data sources, 100,000 values, 4,000,000 (value,data source) pairs
Other parameters are in the paper
MetricsIndex creation timeCompression factorFalse positives
Setup 2.8GHz Pentium IV, 1GB memory, 80GB disk C++
Index creation time
Index structure Time(min)
UI 0.25
VCI 15
DCI 3
VDCI 180
HBI 2.5
Equality queries (1000 data sources)
0
10
20
30
40
50
0 5 10 15 20compression factor
false
posit
ives
VCI DCI VDCI HBI
Range Queries (1000 data sources)
0
5
10
15
20
25
30
35
40
45
50
0 5 10 15 20compression factor
fals
e po
sitiv
es
VCI DCI VDCI HBI
Outline
Query model Index Structures Experimental Evaluation Related work and conclusion
Related work
Distributed database & information integration Niagara system [Naughton01] GlOSS [Gravano99] …
Database/Inverted list compression Query Optimization in Compressed Databases [Chen 01] Compressing the Relations and Index [Goldstein 98] Improved Query Performance with Variant Indices
[O’Neill 97] Implementation and Performance of Compressed
Databases [Westmann 00] Size Reduction of Inverted Files [Weiss 90] …
Conclusion
Space-efficient index structures for querying the deep web
Support equality and range queries A factor of 10 compression with a little
loss in precision
Future work Combine cluster-based and histogram-based Multiple attributes queries Joins Incremental index maintenance
Questions?
Experimental setupOther parameters:
Number of groups The data sources in the same group use same distribution to generate the values
Default 20
Group mode How many groups a data source belongs to
Default 1
Value correlation How the orders in the value space maps to the value ordering over which Gaussian distribution is used.
Default 0.2