xcluster synopses for structured xml content alkis polyzotis (uc santa cruz) minos garofalakis...

XCluster Synopses for Structured XML

ContentAlkis Polyzotis (UC Santa Cruz)

Minos Garofalakis (Intel Research, Berkeley)

XML Summarization

Synopses are essential for XML data managementStatistics for XML query optimizationApproximate query answering

Active research topic in the field of XML databasesMarkov Tables, XSketch, XPathLearner, CSTs, TreeSketch,...

XML XML DataData

SynopsisSynopsis

count(Q) Selectivity of Q

Estimated selectivity of Q

count(Q)

Content HeterogeneityData

Queries

<paper><year>2003</year><title>The history of histograms (abridged)</title><author>Yannis Ioannidis</author><abstract>The history of histograms is long and rich, full of detailed information in every step. It...</abstract></paper>

//paper[year>2000][author contains “Ioannidis”]//abstract[ftcontains histograms,history]

NumericalNumericalStringString

Text Text

RangeRange SubstringSubstring

Term Term Containment Containment

Synopses and Heterogeneity

Mixed predicates => Unified summarization model

Path structureValues of different typesCorrelations between and across

Summarization for textual values

//paper[year>2000][author contains “Ioannidis”]//abstract[ftcontains histograms,history]XML XML

DataData

SynopsisSynopsis

XCluster SynopsesData synopses for heterogeneous XML content

Unified summarization for path structure and numerical, string, and textual contentSupport for twig queries with mixed predicates

XCluster model <=> Element clustering Tight cluster <=> Similar structure and valuesExtensibility to other value types

Principled compression frameworkExperimental results: high accuracy with low storage requirements

Outline

PreliminariesXCluster ModelXCluster CompressionConstruction AlgorithmExperimental Study

Data and Query Model

Tree data with heterogeneous value contentTree-pattern queries with XPath expressions

Result: set of binding tuples

for $q0 in /,$q1 in $q0/p[y>1999], $q2 in $q1/t[contains(XML), $q3 in $q1/ab[ ftcontains(synopsis,data) ]

q0

q1

q3

q2

Numeric Numerical al

Text Text

Text Text

String String

RangeRange

SubstringSubstring

Term TermContainmen Containmen

tt

DataQuery

Problem Definition

Problem: build a data synopsis that can estimate the selectivity of any queryChallenges:

Heterogeneity of contentData correlations

SynopsisSynopsis

XCluster Model

Structural Summarization

Node <=> Elements of same tagStatistical information: node- and edge-counts

Node-count: number of elements in clusterEdge-count: average number of children

XClusterData

Value Summarization

Value summary => Fractional value distribution

Single-dimensionalApproximation method depends on value type

XClusterData

Types of Value SummariesNumerical Content => Histograms

String Content => Pruned Suffix TriesText Content => End-biased Term Histograms

“The history of histograms is long and rich, full of detailed information in every step. It...”

Term Freq0 (history) 21 (histogram) 72 (data) 63 (database) 54 (information) 35 (value) 2

Bucket Freq

010000 7

001000 6

000100 5

100011 7/3

Text Term Matrix Term Histogram

XCluster Model

A node aggregates information about its elementsCorrespondence to clustering: node <=> cluster <=> centroid elementBasic assumptions: independence and uniformity

Tight clusters => Valid assumptions

Each element in A has:- 2 children in B- 3 children in C- value x with prob 70%- value y with prob. 30%

Estimation Example

XCluster

Query

sel(Q)=(1)*(2)*(1*st)*(1/2*sk)

1*st children

1/2*sk children

2 children

1 element

Two-step estimation algorithm:Identify embeddingsEstimate selectivity of each embedding

Accuracy depends on “tightness” of centroids

Embedding

XCluster Compression

Structural Compression

Merge two nodes of same tagNew node acquires aggregate characteristics

Node- and edge-counts are aggregatedValue summaries are “fused”

Conceptually equivalent to cluster merging

Value-Based Compression

Reduce the storage of a single value summarySpecifics depend on type of summary

Histogram: merge k bucketsPruned Suffix Trie: prune k nodes

Remove leaf nodes based on statistical independenceTerm Histogram: move k terms to the uniform bucket

Compression vs. Accuracy

Δ(S,S’): difference in accuracy between S and S’ Key idea: apply operations with low Δ(S,S’)

Absolute vs. Relative metric

Original XCluster S Compressed XCluster S’

SS

S’S’RR

SS

S’S’Absolute Relative

Distance Metric Δ(S,S’)

μ-query => basic query involving structure+values

u[s]/c: the number of children in c per element in u that satisfies value predicate sIntuition: capture centroid information pertaining to c and s

Δ(S,S’): difference of estimates for μ-queries

€

Δ(S,S') = u (u[s]/c −w[s]/c)2 +s,c∑ v (v[s]/c −w[s]/c)2

s,c∑

S S’

XCluster Construction

XCluster Construction

Step 1: Build reference synopsisCount stability + Detailed value summaries

Step 2: Compress structural informationStep 3: Compress value-based information

XML Data

ReferenceSummary

XCluster withdetailed valuedistributions

XClusterç±ç±ç±ç± ç±ç±Step 1 Step 2 Step 3

Structural Compression

Algorithm sketch:1. Generate pool of candidate merge operations2. Apply operations in increasing order of Δ(S,S’)3. Repeat until size < budgetA-priori generation of candidates

Merges at level l trigger merges at level l-1Adaptive, leaf-to-root merging of nodes

XML Data

ReferenceSummary



Value-Based Compression

Algorithm sketch:1. Generate one operation for each value summary2. Apply value compression with least Δ(S,S’)3. Repeat until size < budgetGenerate operations of “least effect”:

Histograms: merge buckets with least differencePSTs: prune leaves with max independenceTerm Histograms: remove singletons of least freq.

XML Data

ReferenceSummary



Experimental Study

MethodologyData sets:

Workloads: random twig queriesStructure only and with predicatesBiased toward high selectivities

Metrics:Absolute relative error: |true-estim|/max(true,s)Absolute error: |true-estim|

#Elements #Value Paths Ref. Size (KB)XMark 206130 9 869IMDB 236822 7 462

Accuracy of XClusters

0102030405060708090

150 155 160 165 170 175 180 185 190 195 200Synopsis Size (KB)

Estimation Error (%)

OverallStructNumericStringText

IMDB

XCluster vs. TreeSketch

05

101520253035404550

0 5 10 15 20 25 30 35 40 45 50Synopsis Size (KB)

Estimation Error (%)

XClusterTreeSketch

XMark

ConclusionsXML synopses are essential for XML query optimizationOur contribution: XCluster Synopses

XML summaries for heterogeneous contentSupport for twig queries with numerical, string, and textual predicates

XCluster model: generalized element clusteringPrincipled construction algorithmExperimental results: high accuracy with low storage requirements

xcluster synopses for structured xml content alkis polyzotis (uc santa cruz) minos garofalakis...

Documents