lecture 8: xml compression comp9319. 2 semistructured data / xml zsemistructured => yloosely...
TRANSCRIPT
2
Semistructured Data / XMLSemistructured Data / XML
Semistructured => loosely structured (no restrictions on tags &
nesting relationships) no schema required
XML under the “semistructured” umbrella self-describing the standard for information representation
& exchange
3
XML data file can be modeled in a tree XML data file can be modeled in a tree formform
<Staff> <Name> <FirstName> Raymond </FirstName> <LastName> Wong </LastName> </Name> <Login> wong </Login> <Ext> 5932 </Ext></Staff>
<Staff> <Name> <FirstName> Raymond </FirstName> <LastName> Wong </LastName> </Name> <Login> wong </Login> <Ext> 5932 </Ext></Staff>
Staff
Name Login Ext
“wong” “5932”
“Raymond” “Wong”
FirstName LastName
4
XPath evaluationXPath evaluation
<a><b><c>12</c><d>7</d></b><b><c>7</c></b></a>
a
b b
c d c
12 7 7
/ a / b [c = “12”]
6
XPath evaluationXPath evaluation
<a><b><c>12</c><d>7</d></b><b><c>7</c></b></a>
a
b b
c d c
12 7 7
/ a / b [c = “12”]
7
XPath evaluationXPath evaluation
<a><b><c>12</c><d>7</d></b><b><c>7</c></b></a>
a
b b
c d c
12 7 7
/ a / b [c = “12”]
<b><c>12</c><d>7</d></b>
8
Path indexingPath indexing
Traversing graph/tree almost = query processing for semistructured / XML data
Normally, it requires to traverse the data from the root and return all nodes X reachable by a path matching the given regular path expression
Motivation: allows the system to answer regular path expressions without traversing the whole graph/tree
9
Major Criteria for indexingMajor Criteria for indexing
Speed up the search (by cutting the search space down)
Relatively smaller size than the original data graph/tree
Easy to maintain (during data loading during updates)
10
An Example of DAG DataAn Example of DAG Data
root
o12
o1
o2
o3 o4
o5 o6
o7
o8
o9 o10
o11
o13
member
deptdept
support
membermember
dept
staff staff staff
membermember
member
name name namephone
staff
11
Index graph based on language-Index graph based on language-equivalence equivalence
a reduced graph that summarizes all paths from the root in the data graph
The paths from root to o12 staff dept/member support/member
12
Language-equivalent nodesLanguage-equivalent nodes
Let L(x) := {w | a path from the root to x labeled
w}The set L(x) may be infinite when there
are cyclesNodes x, y are language-equivalent (x
y) if L(x) = L(y)We construct index I by taking the
nodes to be the equivalent classes for
13
Language-equivalentLanguage-equivalent
The paths from root to o3 staff dept/member
Paths to o4 happen to be exactly the same 2 sequences
Same for o8 and o12o3 o4 o8 o12
14
Equivalence classesEquivalence classes
root
o12
o1o2
o3 o4
o5 o6
o7
o8
o9 o10
o11
o13
member
deptdept
support
membermember
dept
staff staff staff
member member member
name name name phone
staff
o3 o4 o8 o12
o1 o2 o7
o12 o13
o5 o6 o9
o10
o11
15
The index graphThe index grapho3 o4 o8 o12o1 o2 o7o12 o13o5 o6 o9o10o11
root
o1, o2, o7
o3, o4, o8, o12
o12, o13
o5, o6, o9 o10
o11
member
supportstaff dept
name phone
member
16
Query processing based on the Query processing based on the index graphindex graph
root
o1, o2, o7
o3, o4, o8, o12
o12, o13
o5, o6, o9 o10
o11
member
supportstaff dept
name phone
member
dept/member/(name | phone)
-> dept/member/name UNION dept/member/phone
-> {o5, o6, o9} UNION {o10}
-> {o5, o6, o9, o10}
17
About this indexing schemeAbout this indexing scheme
The index graph is never > the dataIn practice, the index graph is small
enough to fit in memoryConstruct the index is however a
problem check two nodes are language-equivalent
is very expensive (are PSPACE) approximation based on bisimulation exists
18
A Data GuideA Data Guide
root
o1, o2, o7 o3, o4, o8, o12
deptstaff
support
o11
o12, o13
member
o3, o4, o8, o12
member name
o5, o6, o9
name
o5, o6, o9
o10
phone
o10
phone
19
About Data GuideAbout Data Guide
unique labels at each node(hence) extents are no longer
disjointquery processing proceeds as beforesize of the index may >= data sizegood for data that is regular & has
no cycles
20
XML-Specific CompressorsXML-Specific Compressors
Unqueriable Compression (e.g. XMill): Full-chunked: data commonalities
eliminated Very good compression ratio
Queriable Compression (e.g. XGrind, XPRESS): Fine-grained: data commonalities ignored Inadequate compression ratio and time Support simple path queries with atomic
predicate
21
XMillXMill
First specialized compressor for XML data SAX parser for parsing XML data Still using gzip as its underlying compressor Clever grouping of data into containers for compression
Compress XML via three basic techniques Compress the structure separately from the data Group the data values according to their types Apply semantic (specialized) compressors:
Downloadable: www.cs.washington.edu/homes/suciu/XMILL
23
An Example:Web Server LogsAn Example:Web Server Logs
202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I)202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I)
<apache:entry> <apache:host> 202.239.238.16 </apache:host> <apache:requestLine> GET / HTTP/1.0
</apache:requestLine> <apache:contentType> text/html </apache:contentType> <apache:statusCode> 200</apache:statusCode> <apache:date> 1997/10/01-00:00:02</apache:date> <apache:byteCount> 4478</apache:byteCount> <apache:referer> http://www.net.jp/ </apache:referer> <apache:userAgent>
Mozilla/3.1$[$ja$]$(I)</apache:userAgent></apache:entry>
<apache:entry> <apache:host> 202.239.238.16 </apache:host> <apache:requestLine> GET / HTTP/1.0
</apache:requestLine> <apache:contentType> text/html </apache:contentType> <apache:statusCode> 200</apache:statusCode> <apache:date> 1997/10/01-00:00:02</apache:date> <apache:byteCount> 4478</apache:byteCount> <apache:referer> http://www.net.jp/ </apache:referer> <apache:userAgent>
Mozilla/3.1$[$ja$]$(I)</apache:userAgent></apache:entry>
ASCII File 15.9 MB (gzipped 1.6MB):
XML-ized apache web log inflates to 24.2 MB (gzipped 2.1MB):
24
How Xmill Works: Three IdeasHow Xmill Works: Three Ideas
<apache:entry> <apache:host>
</apache:host> . . .</apache:entry>
<apache:entry> <apache:host>
</apache:host> . . .</apache:entry>
202.239.238.16 GET / HTTP/1.0 text/html 200…
202.239.238.16 GET / HTTP/1.0 text/html 200…
gzip Structure gzip Data
=1.75MB+
Compress the structure separately from the data:
25
How Xmill Works: Three IdeasHow Xmill Works: Three Ideas
<apache:entry>
. . .</
apache:entry>
<apache:entry>
. . .</
apache:entry>
202.23.23.16224.42.24.55…
202.23.23.16224.42.24.55…
gzip Structure gzip Data1
=1.33MB+GET / HTTP/1.0GET / HTTP/1.1…
GET / HTTP/1.0GET / HTTP/1.1…
gzip Data2
+
Group the data values according to their types:
26
How Xmill Works: Three IdeasHow Xmill Works: Three Ideas
gzip Structure + gzip c1(Data1) + gzip c2(Data2) + ... =0.82MB
Apply semantic (specialized) compressors:
Examples:• 8, 16, 32-bit integer encoding (signed/unsigned)• differential compressing (e.g. 1999, 1995, 2001, 2000, 1995, ...)• compress lists, records (e.g. 104.32.23.1 4 bytes)Need user input to select the semantic compressor
31
XGRIND (Tolani & Haritsa, 2002)XGRIND (Tolani & Haritsa, 2002)
Encodes elements and attributes using XMill’s approach
DTD-conscious: enumerated attributes with k possible values are encoded using a log2 k-bit scheme
Data values are encoded using non-adaptive Huffman coding Requires two passes over the input document Separate statistical model for each
element/attribute Homomorphic compression: compressed
document retains original structure
June 24, 2008 XML Compression Techniques 31
32
XGRINDXGRIND
Original Fragment:
<student name=“Alice“> <a1>78</a1> <a2>86</a2>
<midterm>91</midterm>
<project>87</project>
</student>
Compressed Fragment:
T0 A0 nahuff(Alice) T1 nahuff(78) /T2 nahuff(86) /T3 nahuff(91) /T4 nahuff(87) /
/
June 24, 2008 XML Compression Techniques 32
33
XGRINDXGRIND
Many queries can be carried out entirely in compressed domain Exact-match, prefix-match
Some others require only decompression of relevant values Range, substring
Queryability comes at the expense of achievable compression ratio: typically within 65-75% that of XMill
June 24, 2008 XML Compression Techniques 33
34
ISX RequirementsISX Requirements
1. Space does matter for many applications2. Generally reducing space improves
cache locality3. Indirection is expensive4. Support fast navigations5. Support fast insertion and deletion6. Support efficient joins7. Separate topology, text and schema
35
ISX GoalISX Goal
To find a space-efficient storage scheme for XML data without compromising both query and update performances
38
Balanced Parenthesis EncodingBalanced Parenthesis Encoding
0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1
40
Topology TiersTopology Tiers
No. of (No. of )
No. of text nodes
Min, max of forward excess
Min, max of backward excess
42
Topology TiersTopology Tiers
No. of (No. of )
No. of text nodes
Min, max of forward excess
Min, max of backward excess
Excess 2
Where is the close tag?
46
Another exampleAnother example
5M DBLP
MSXML ISX
Runtime (loading)
15MB 4MB
Loading time
0.54s 0.035s
Runtime (//www)
21MB 4MB
//www 0.096s 0.004s
•Core Duo 1.83GHz
•1GB RAM
•5400 RPM Harddrive
•MS Vista100M DBLP
MSXML ISX
Runtime (loading)
329MB 67MB
Loading time
17.8s 0.67s
Runtime (//www)
333MB 67MB
//www 1.814s 0.143s
48
ExperimentsExperiments
Setup
Fixed at 64MB memory bufferUp to 16 GB XML documentE.g. 16 GB DBLP contains > 770 million nodesNO index or query optimization has been
employed for ISX (except for ISX Stream where TurboXPath algorithm has been employed)
57
XPath 13 axesXPath 13 axes We can navigate along 13 axes:
ancestorancestor-or-selfattributechilddescendantdescendant-or-selffollowingfollowing-siblingnamespaceparentprecedingpreceding-siblingself
61
ISX SummaryISX Summary
Small storage footprint Small runtime footprint Fast and consistent performance on
navigational accessSuperior query performance (further
indexing / query optimization can be added)
Superior update performance
Compressing and Searching XML Compressing and Searching XML Data Via Two ZipsData Via Two Zips
Paolo Ferragina et al.
Slides modified from P. Ferragina’s
63
An XML excerptAn XML excerpt<dblp> <book>
<author> Donald E. Knuth </author><title> The TeXbook </title><publisher> Addison-Wesley </publisher><year> 1986 </year>
</book> <article>
<author> Donald E. Knuth </author><author> Ronald W. Moore </author><title> An Analysis of Alpha-Beta Pruning </title><pages> 293-326 </pages><year> 1975 </year><volume> 6 </volume><journal> Artificial Intelligence </journal>
</article>
...</dblp>
It is verbose !
64
A tree interpretation...A tree interpretation...
XML document exploration Tree navigation XML document search Labeled subpath
searches
Subset of XPath [W3C]
65
The ProblemThe Problem
Summary indexes (like Dataguide, 1-index or 2-index) large space and do not support “content” searches
XML-aware compressors (like XMill, XmlPpm, ScmPpm,...) need the whole decompression
We wish to devise a compressed representation for a labeled tree T that efficiently supports some operations:
Navigational operations: parent(u), child(u, i), child(u, i, c) Subpath searches: given a sequence of k labels Content searches: subpath + substring search Visualization operation: given a node, visualize its descending subtree
XML-queriable compressors (like XPress, XGrind, XQzip,...) poor compression and scan of the whole (compressed) file
XML-native search engines
might exploit this tool as a core block for
query optimization and (compressed) storage
Theoretically do exist many solutions, starting from [Jacobson, IEEE
Focs ’89] no subpath/content searches, and poor performance on labeled
trees
66
A transform for “labeled trees”A transform for “labeled trees” [Ferragina [Ferragina et al, IEEE et al, IEEE Focs ’05]Focs ’05]
We proposed the XBW-transform that mimics on trees the nice structural properties of the Burrows-and-Wheeler Trasform on strings
The XBW linearizes the tree T in 2 arrays s.t.:
the compression of T reduces to use any compressor (gzip, bzip,...) over these two arrays
the indexing of T reduces to implement simple rank/select query operations over these two arrays
67
The XBW-TransformThe XBW-TransformC
B A B
D c
c a
b a D
c
D a
b
CBDcacAb aDcBDba
S
CB CD B CD B CB CCA CA CA CD A CCB CD B CB C
S
upward labeled paths
Permutationof tree nodes
Step 1.Visit the tree in pre-order. For each node, write down its label and the labels on its upward path
68
The XBW-TransformThe XBW-TransformC
B A B
D c
c a
b a D
c
D a
b
CbaDDc DaBABccab
S
A CA CA CB CB CB CB C CCCD A CD B CD B CD B C
S
upward labeled paths
Step 2.Stably sort according to S
69
1001010 10011011
The XBW-TransformThe XBW-TransformC
B A B
D c
c a
b a D
c
D a
b
CbaDDc DaBABccab
S
A CA CA CB CB CB CB C CCCD A CD B CD B CD B C
S
Step 3.Add a binary array Slast marking the
rows corresponding to last children
Slast
XBW
Key fact
Nodes correspond to items in <Slast,S>
70
XBzipXBzip – a simple XML compressor – a simple XML compressor
Pcdata
Tags, Attributes and symbol =
XBW is compressible:
S and Spcdata are locally homogeneous
Slast has some structure
71
1001010 10011011
Some structural propertiesSome structural properties
CbaDDc DaBABccab
S
A CA CA CB CB CB CB C CCCD A CD B CD B CD B C
SSlast
XBW
C
B A B
D c
c a
b a D
c
D a
b
C
B A B
D c
c a
b a D
c
D a
b
Two useful properties:
• Children are contiguous and delimited by 1s
• Children reflect the order of their parents
B
72
1001010 10011011
XBW is navigationalXBW is navigational
CbaDDc DaBABccab
S
A CA CA CB CB CB CB C CCCD A CD B CD B CD B C
SSlast
XBW
C
B A B
D c
c a
b a D
c
D a
b
C
A B
D c
c a
b a D
c
D a
b
XBW is navigational:
• Rank-Select data structures on Slast and S
• The array C of || integers
B
Get_children
Rank(B,S)=2
Select in Slast the 2° item 1from here...
A 2B 5C 9D 12
C
73
1001010 10011011
XBW is searchable XBW is searchable (count subpaths)(count subpaths)C
B A B
D c
c a
b a D
c
D a
b
CbaDDc DaBABccab
S
A CA CA CB CB CB CB C CCCD A CD B CD B CD B C
SSlast
XBW-index
Inductive step:
Pick the next char in [i+1]i.e. ‘D’
Search for the first and last ‘D’ in S[fr,lr]
Jump to their children
fr
lr
= B D
[i+1]
Rows whoseS starts with ‘B’
Their childrenhave upwardpath = ‘D B’
A 2B 5C 9D 12
lr
fr
XBW is searchable:
• Rank-Select data structures on Slast and S
• Array C of || integers
C
2 occurrences of
because of two 1s