xcluster synopses for structured xml content alkis polyzotis (uc santa cruz) minos garofalakis...

28
XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Upload: lynne-dean

Post on 17-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

Content Heterogeneity Data Queries 2003 The history of histograms (abridged) Yannis Ioannidis The history of histograms is long and rich, full of detailed information in every step. It... //paper[year>2000][author contains “Ioannidis”]// abstract[ftcontains histograms,history] Numerical String Text RangeSubstring Term Containment

TRANSCRIPT

Page 1: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

XCluster Synopses for Structured XML

ContentAlkis Polyzotis (UC Santa Cruz)

Minos Garofalakis (Intel Research, Berkeley)

Page 2: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

XML Summarization

Synopses are essential for XML data managementStatistics for XML query optimizationApproximate query answering

Active research topic in the field of XML databasesMarkov Tables, XSketch, XPathLearner, CSTs, TreeSketch,...

XML XML DataData

SynopsisSynopsis

count(Q) Selectivity of Q

Estimated selectivity of Q

count(Q)

Page 3: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Content HeterogeneityData

Queries

<paper><year>2003</year><title>The history of histograms (abridged)</title><author>Yannis Ioannidis</author><abstract>The history of histograms is long and rich, full of detailed information in every step. It...</abstract></paper>

//paper[year>2000][author contains “Ioannidis”]//abstract[ftcontains histograms,history]

NumericalNumericalStringString

Text Text

RangeRange SubstringSubstring

Term Term Containment Containment

Page 4: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Synopses and Heterogeneity

Mixed predicates => Unified summarization model

Path structureValues of different typesCorrelations between and across

Summarization for textual values

//paper[year>2000][author contains “Ioannidis”]//abstract[ftcontains histograms,history]XML XML

DataData

SynopsisSynopsis

Page 5: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

XCluster SynopsesData synopses for heterogeneous XML content

Unified summarization for path structure and numerical, string, and textual contentSupport for twig queries with mixed predicates

XCluster model <=> Element clustering Tight cluster <=> Similar structure and valuesExtensibility to other value types

Principled compression frameworkExperimental results: high accuracy with low storage requirements

Page 6: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Outline

PreliminariesXCluster ModelXCluster CompressionConstruction AlgorithmExperimental Study

Page 7: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Data and Query Model

Tree data with heterogeneous value contentTree-pattern queries with XPath expressions

Result: set of binding tuples

for $q0 in /,$q1 in $q0/p[y>1999], $q2 in $q1/t[contains(XML), $q3 in $q1/ab[ ftcontains(synopsis,data) ]

q0

q1

q3

q2

Numeric Numerical al

Text Text

Text Text

String String

RangeRange

SubstringSubstring

Term TermContainmen Containmen

tt

DataQuery

Page 8: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Problem Definition

Problem: build a data synopsis that can estimate the selectivity of any queryChallenges:

Heterogeneity of contentData correlations

SynopsisSynopsis

Page 9: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

XCluster Model

Page 10: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Structural Summarization

Node <=> Elements of same tagStatistical information: node- and edge-counts

Node-count: number of elements in clusterEdge-count: average number of children

XClusterData

Page 11: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Value Summarization

Value summary => Fractional value distribution

Single-dimensionalApproximation method depends on value type

XClusterData

Page 12: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Types of Value SummariesNumerical Content => Histograms

String Content => Pruned Suffix TriesText Content => End-biased Term Histograms

“The history of histograms is long and rich, full of detailed information in every step. It...”

Term Freq0 (history) 21 (histogram) 72 (data) 63 (database) 54 (information) 35 (value) 2

Bucket Freq

010000 7

001000 6

000100 5

100011 7/3

Text Term Matrix Term Histogram

Page 13: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

XCluster Model

A node aggregates information about its elementsCorrespondence to clustering: node <=> cluster <=> centroid elementBasic assumptions: independence and uniformity

Tight clusters => Valid assumptions

Each element in A has:- 2 children in B- 3 children in C- value x with prob 70%- value y with prob. 30%

Page 14: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Estimation Example

XCluster

Query

sel(Q)=(1)*(2)*(1*st)*(1/2*sk)

1*st children

1/2*sk children

2 children

1 element

Two-step estimation algorithm:Identify embeddingsEstimate selectivity of each embedding

Accuracy depends on “tightness” of centroids

Embedding

Page 15: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

XCluster Compression

Page 16: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Structural Compression

Merge two nodes of same tagNew node acquires aggregate characteristics

Node- and edge-counts are aggregatedValue summaries are “fused”

Conceptually equivalent to cluster merging

Page 17: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Value-Based Compression

Reduce the storage of a single value summarySpecifics depend on type of summary

Histogram: merge k bucketsPruned Suffix Trie: prune k nodes

Remove leaf nodes based on statistical independenceTerm Histogram: move k terms to the uniform bucket

Page 18: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Compression vs. Accuracy

Δ(S,S’): difference in accuracy between S and S’ Key idea: apply operations with low Δ(S,S’)

Absolute vs. Relative metric

Original XCluster S Compressed XCluster S’

SS

S’S’RR

SS

S’S’Absolute Relative

Page 19: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Distance Metric Δ(S,S’)

μ-query => basic query involving structure+values

u[s]/c: the number of children in c per element in u that satisfies value predicate sIntuition: capture centroid information pertaining to c and s

Δ(S,S’): difference of estimates for μ-queries

Δ(S,S') = u (u[s]/c −w[s]/c)2 +s,c∑ v (v[s]/c −w[s]/c)2

s,c∑

S S’

Page 20: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

XCluster Construction

Page 21: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

XCluster Construction

Step 1: Build reference synopsisCount stability + Detailed value summaries

Step 2: Compress structural informationStep 3: Compress value-based information

XML Data

ReferenceSummary

XCluster withdetailed valuedistributions

XClusterç±ç±ç±ç± ç±ç±Step 1 Step 2 Step 3

Page 22: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Structural Compression

Algorithm sketch:1. Generate pool of candidate merge operations2. Apply operations in increasing order of Δ(S,S’)3. Repeat until size < budgetA-priori generation of candidates

Merges at level l trigger merges at level l-1Adaptive, leaf-to-root merging of nodes

XML Data

ReferenceSummary

XCluster withdetailed valuedistributions

XClusterç±ç±ç±ç± ç±ç±Step 1 Step 2 Step 3

Page 23: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Value-Based Compression

Algorithm sketch:1. Generate one operation for each value summary2. Apply value compression with least Δ(S,S’)3. Repeat until size < budgetGenerate operations of “least effect”:

Histograms: merge buckets with least differencePSTs: prune leaves with max independenceTerm Histograms: remove singletons of least freq.

XML Data

ReferenceSummary

XCluster withdetailed valuedistributions

XClusterç±ç±ç±ç± ç±ç±Step 1 Step 2 Step 3

Page 24: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Experimental Study

Page 25: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

MethodologyData sets:

Workloads: random twig queriesStructure only and with predicatesBiased toward high selectivities

Metrics:Absolute relative error: |true-estim|/max(true,s)Absolute error: |true-estim|

#Elements #Value Paths Ref. Size (KB)XMark 206130 9 869IMDB 236822 7 462

Page 26: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Accuracy of XClusters

0102030405060708090

150 155 160 165 170 175 180 185 190 195 200Synopsis Size (KB)

Estimation Error (%)

OverallStructNumericStringText

IMDB

Page 27: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

XCluster vs. TreeSketch

05

101520253035404550

0 5 10 15 20 25 30 35 40 45 50Synopsis Size (KB)

Estimation Error (%)

XClusterTreeSketch

XMark

Page 28: XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

ConclusionsXML synopses are essential for XML query optimizationOur contribution: XCluster Synopses

XML summaries for heterogeneous contentSupport for twig queries with numerical, string, and textual predicates

XCluster model: generalized element clusteringPrincipled construction algorithmExperimental results: high accuracy with low storage requirements