lecture 8: xml compression comp9319. 2 semistructured data / xml zsemistructured => yloosely...

Lecture 8: XML CompressionLecture 8: XML Compression

COMP9319

2

Semistructured Data / XMLSemistructured Data / XML

Semistructured => loosely structured (no restrictions on tags &

nesting relationships) no schema required

XML under the “semistructured” umbrella self-describing the standard for information representation

& exchange

3

XML data file can be modeled in a tree XML data file can be modeled in a tree formform

<Staff> <Name> <FirstName> Raymond </FirstName> <LastName> Wong </LastName> </Name> <Login> wong </Login> <Ext> 5932 </Ext></Staff>

<Staff> <Name> <FirstName> Raymond </FirstName> <LastName> Wong </LastName> </Name> <Login> wong </Login> <Ext> 5932 </Ext></Staff>

Staff

Name Login Ext

“wong” “5932”

“Raymond” “Wong”

FirstName LastName

4

XPath evaluationXPath evaluation

<a><c>12</c><d>7</d><c>7</c></a>

a

b b

c d c

12 7 7

/ a / b [c = “12”]

5

Query evaluationQuery evaluation

Top-down

Bottom-up

Hybrid

6


<a><c>12</c><d>7</d><c>7</c></a>

a

b b

c d c

12 7 7

/ a / b [c = “12”]

7


<a><c>12</c><d>7</d><c>7</c></a>

a

b b

c d c

12 7 7

/ a / b [c = “12”]

<c>12</c><d>7</d>

8

Path indexingPath indexing

Traversing graph/tree almost = query processing for semistructured / XML data

Normally, it requires to traverse the data from the root and return all nodes X reachable by a path matching the given regular path expression

Motivation: allows the system to answer regular path expressions without traversing the whole graph/tree

9

Major Criteria for indexingMajor Criteria for indexing

Speed up the search (by cutting the search space down)

Relatively smaller size than the original data graph/tree

Easy to maintain (during data loading during updates)

10

An Example of DAG DataAn Example of DAG Data

root

o12

o1

o2

o3 o4

o5 o6

o7

o8

o9 o10

o11

o13

member

deptdept

support

membermember

dept

staff staff staff

membermember

member

name name namephone

staff

11

Index graph based on language-Index graph based on language-equivalence equivalence

a reduced graph that summarizes all paths from the root in the data graph

The paths from root to o12 staff dept/member support/member

12

Language-equivalent nodesLanguage-equivalent nodes

Let L(x) := {w | a path from the root to x labeled

w}The set L(x) may be infinite when there

are cyclesNodes x, y are language-equivalent (x

y) if L(x) = L(y)We construct index I by taking the

nodes to be the equivalent classes for

13

Language-equivalentLanguage-equivalent

The paths from root to o3 staff dept/member

Paths to o4 happen to be exactly the same 2 sequences

Same for o8 and o12o3 o4 o8 o12

14

Equivalence classesEquivalence classes

root

o12

o1o2

o3 o4

o5 o6

o7

o8

o9 o10

o11

o13

member

deptdept

support

membermember

dept

staff staff staff

member member member

name name name phone

staff

o3 o4 o8 o12

o1 o2 o7

o12 o13

o5 o6 o9

o10

o11

15

The index graphThe index grapho3 o4 o8 o12o1 o2 o7o12 o13o5 o6 o9o10o11

root

o1, o2, o7

o3, o4, o8, o12

o12, o13

o5, o6, o9 o10

o11

member

supportstaff dept

name phone

member

16

Query processing based on the Query processing based on the index graphindex graph

root

o1, o2, o7

o3, o4, o8, o12

o12, o13

o5, o6, o9 o10

o11

member

supportstaff dept

name phone

member

dept/member/(name | phone)

-> dept/member/name UNION dept/member/phone

-> {o5, o6, o9} UNION {o10}

-> {o5, o6, o9, o10}

17

About this indexing schemeAbout this indexing scheme

The index graph is never > the dataIn practice, the index graph is small

enough to fit in memoryConstruct the index is however a

problem check two nodes are language-equivalent

is very expensive (are PSPACE) approximation based on bisimulation exists

18

A Data GuideA Data Guide

root

o1, o2, o7 o3, o4, o8, o12

deptstaff

support

o11

o12, o13

member

o3, o4, o8, o12

member name

o5, o6, o9

name

o5, o6, o9

o10

phone

o10

phone

19

About Data GuideAbout Data Guide

unique labels at each node(hence) extents are no longer

disjointquery processing proceeds as beforesize of the index may >= data sizegood for data that is regular & has

no cycles

20

XML-Specific CompressorsXML-Specific Compressors

Unqueriable Compression (e.g. XMill): Full-chunked: data commonalities

eliminated Very good compression ratio

Queriable Compression (e.g. XGrind, XPRESS): Fine-grained: data commonalities ignored Inadequate compression ratio and time Support simple path queries with atomic

predicate

21

XMillXMill

First specialized compressor for XML data SAX parser for parsing XML data Still using gzip as its underlying compressor Clever grouping of data into containers for compression

Compress XML via three basic techniques Compress the structure separately from the data Group the data values according to their types Apply semantic (specialized) compressors:

Downloadable: www.cs.washington.edu/homes/suciu/XMILL

http://www.megginson.com/downloads/SAX/

http://www.cs.washington.edu/homes/suciu/XMILL

22

XMill Architecture:XMill Architecture:

23

An Example:Web Server LogsAn Example:Web Server Logs

202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I)202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I)

<apache:entry> <apache:host> 202.239.238.16 </apache:host> <apache:requestLine> GET / HTTP/1.0

</apache:requestLine> <apache:contentType> text/html </apache:contentType> <apache:statusCode> 200</apache:statusCode> <apache:date> 1997/10/01-00:00:02</apache:date> <apache:byteCount> 4478</apache:byteCount> <apache:referer> http://www.net.jp/ </apache:referer> <apache:userAgent>

Mozilla/3.1$[$ja$]$(I)</apache:userAgent></apache:entry>

<apache:entry> <apache:host> 202.239.238.16 </apache:host> <apache:requestLine> GET / HTTP/1.0

</apache:requestLine> <apache:contentType> text/html </apache:contentType> <apache:statusCode> 200</apache:statusCode> <apache:date> 1997/10/01-00:00:02</apache:date> <apache:byteCount> 4478</apache:byteCount> <apache:referer> http://www.net.jp/ </apache:referer> <apache:userAgent>

Mozilla/3.1$[$ja$]$(I)</apache:userAgent></apache:entry>

ASCII File 15.9 MB (gzipped 1.6MB):

XML-ized apache web log inflates to 24.2 MB (gzipped 2.1MB):

http://httpd.apache.org/docs/logs.html

24

How Xmill Works: Three IdeasHow Xmill Works: Three Ideas

<apache:entry> <apache:host>

</apache:host> . . .</apache:entry>

<apache:entry> <apache:host>

</apache:host> . . .</apache:entry>

202.239.238.16 GET / HTTP/1.0 text/html 200…

202.239.238.16 GET / HTTP/1.0 text/html 200…

gzip Structure gzip Data

=1.75MB+

Compress the structure separately from the data:

25


<apache:entry>

. . .</

apache:entry>

<apache:entry>

. . .</

apache:entry>

202.23.23.16224.42.24.55…

202.23.23.16224.42.24.55…

gzip Structure gzip Data1

=1.33MB+GET / HTTP/1.0GET / HTTP/1.1…

GET / HTTP/1.0GET / HTTP/1.1…

gzip Data2

+

Group the data values according to their types:

26


gzip Structure + gzip c1(Data1) + gzip c2(Data2) + ... =0.82MB

Apply semantic (specialized) compressors:

Examples:• 8, 16, 32-bit integer encoding (signed/unsigned)• differential compressing (e.g. 1999, 1995, 2001, 2000, 1995, ...)• compress lists, records (e.g. 104.32.23.1 4 bytes)Need user input to select the semantic compressor

27

ExperimentsExperiments

28

XML CompressionXML Compression

29

Compression TimeCompression Time

30

Transfer Time (& Decode)Transfer Time (& Decode)

31

XGRIND (Tolani & Haritsa, 2002)XGRIND (Tolani & Haritsa, 2002)

Encodes elements and attributes using XMill’s approach

DTD-conscious: enumerated attributes with k possible values are encoded using a log2 k-bit scheme

Data values are encoded using non-adaptive Huffman coding Requires two passes over the input document Separate statistical model for each

element/attribute Homomorphic compression: compressed

document retains original structure

June 24, 2008 XML Compression Techniques 31

32

XGRINDXGRIND

Original Fragment:

<student name=“Alice“> <a1>78</a1> <a2>86</a2>

<midterm>91</midterm>

<project>87</project>

</student>

Compressed Fragment:

T0 A0 nahuff(Alice) T1 nahuff(78) /T2 nahuff(86) /T3 nahuff(91) /T4 nahuff(87) /

/

33

XGRINDXGRIND

Many queries can be carried out entirely in compressed domain Exact-match, prefix-match

Some others require only decompression of relevant values Range, substring

Queryability comes at the expense of achievable compression ratio: typically within 65-75% that of XMill


34

ISX RequirementsISX Requirements

1. Space does matter for many applications2. Generally reducing space improves

cache locality3. Indirection is expensive4. Support fast navigations5. Support fast insertion and deletion6. Support efficient joins7. Separate topology, text and schema

35

ISX GoalISX Goal

To find a space-efficient storage scheme for XML data without compromising both query and update performances

36

Proposed Storage StructureProposed Storage Structure

The ISX Structure

37

Sample DBLP XML FragmentSample DBLP XML Fragment

38

Balanced Parenthesis EncodingBalanced Parenthesis Encoding

0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1

39

Node NavigationsNode Navigations

40

Topology TiersTopology Tiers

No. of (No. of )

No. of text nodes

Min, max of forward excess

Min, max of backward excess

41

Primitive operatorsPrimitive operators

42

Topology TiersTopology Tiers

No. of (No. of )

No. of text nodes

Min, max of forward excess

Min, max of backward excess

Excess 2

Where is the close tag?

43

Tier 2 excessTier 2 excess

44

Efficient UpdatesEfficient Updates

45

ExampleExample

100 MB DBLP document5 million XML nodes

ISX: 1MB topology

46

Another exampleAnother example

5M DBLP

MSXML ISX

Runtime (loading)

15MB 4MB

Loading time

0.54s 0.035s

Runtime (//www)

21MB 4MB

//www 0.096s 0.004s

•Core Duo 1.83GHz

•1GB RAM

•5400 RPM Harddrive

•MS Vista100M DBLP

MSXML ISX

Runtime (loading)

329MB 67MB

Loading time

17.8s 0.67s

Runtime (//www)

333MB 67MB

//www 1.814s 0.143s

47

ISX FeaturesISX Features

48

ExperimentsExperiments

Setup

Fixed at 64MB memory bufferUp to 16 GB XML documentE.g. 16 GB DBLP contains > 770 million nodesNO index or query optimization has been

employed for ISX (except for ISX Stream where TurboXPath algorithm has been employed)

49

Storage Size (ISX vs NoK)Storage Size (ISX vs NoK)

50

Storage Size (ISX, XMill, XGrind): Storage Size (ISX, XMill, XGrind): DBLPDBLP

51

Storage Size (ISX, XMill): TreeBankStorage Size (ISX, XMill): TreeBank

52

Bulk Loading PerformanceBulk Loading Performance

53

QueriesQueries

54

Q1: //inproceedingsQ1: //inproceedings

55

Q5: //article[.//month/text() = Q5: //article[.//month/text() = ““JulyJuly””]//title ]//title

56

Other queriesOther queries

57

XPath 13 axesXPath 13 axes We can navigate along 13 axes:

ancestorancestor-or-selfattributechilddescendantdescendant-or-selffollowingfollowing-siblingnamespaceparentprecedingpreceding-siblingself

58

Node NavigationNode Navigation

59

Full document traversalFull document traversal

60

Update (Insertion) PerformanceUpdate (Insertion) Performance

61

ISX SummaryISX Summary

Small storage footprint Small runtime footprint Fast and consistent performance on

navigational accessSuperior query performance (further

indexing / query optimization can be added)

Superior update performance

Compressing and Searching XML Compressing and Searching XML Data Via Two ZipsData Via Two Zips

Paolo Ferragina et al.

Slides modified from P. Ferragina’s

63

An XML excerptAn XML excerpt<dblp> <book>

<author> Donald E. Knuth </author><title> The TeXbook </title><publisher> Addison-Wesley </publisher><year> 1986 </year>

</book> <article>

<author> Donald E. Knuth </author><author> Ronald W. Moore </author><title> An Analysis of Alpha-Beta Pruning </title><pages> 293-326 </pages><year> 1975 </year><volume> 6 </volume><journal> Artificial Intelligence </journal>

</article>

...</dblp>

It is verbose !

64

A tree interpretation...A tree interpretation...

XML document exploration Tree navigation XML document search Labeled subpath

searches

Subset of XPath [W3C]

65

The ProblemThe Problem

Summary indexes (like Dataguide, 1-index or 2-index) large space and do not support “content” searches

XML-aware compressors (like XMill, XmlPpm, ScmPpm,...) need the whole decompression

We wish to devise a compressed representation for a labeled tree T that efficiently supports some operations:

Navigational operations: parent(u), child(u, i), child(u, i, c) Subpath searches: given a sequence of k labels Content searches: subpath + substring search Visualization operation: given a node, visualize its descending subtree

XML-queriable compressors (like XPress, XGrind, XQzip,...) poor compression and scan of the whole (compressed) file

XML-native search engines

might exploit this tool as a core block for

query optimization and (compressed) storage

Theoretically do exist many solutions, starting from [Jacobson, IEEE

Focs ’89] no subpath/content searches, and poor performance on labeled

trees

66

A transform for “labeled trees”A transform for “labeled trees” [Ferragina [Ferragina et al, IEEE et al, IEEE Focs ’05]Focs ’05]

We proposed the XBW-transform that mimics on trees the nice structural properties of the Burrows-and-Wheeler Trasform on strings

The XBW linearizes the tree T in 2 arrays s.t.:

the compression of T reduces to use any compressor (gzip, bzip,...) over these two arrays

the indexing of T reduces to implement simple rank/select query operations over these two arrays

67

The XBW-TransformThe XBW-TransformC

B A B

D c

c a

b a D

c

D a

b

CBDcacAb aDcBDba

S

CB CD B CD B CB CCA CA CA CD A CCB CD B CB C

S

upward labeled paths

Permutationof tree nodes

Step 1.Visit the tree in pre-order. For each node, write down its label and the labels on its upward path

68


B A B

D c

c a

b a D

c

D a

b

CbaDDc DaBABccab

S

A CA CA CB CB CB CB C CCCD A CD B CD B CD B C

S

upward labeled paths

Step 2.Stably sort according to S

69

1001010 10011011


B A B

D c

c a

b a D

c

D a

b

CbaDDc DaBABccab

S


S

Step 3.Add a binary array Slast marking the

rows corresponding to last children

Slast

XBW

Key fact

Nodes correspond to items in <Slast,S>

70

XBzipXBzip – a simple XML compressor – a simple XML compressor

Pcdata

Tags, Attributes and symbol =

XBW is compressible:

S and Spcdata are locally homogeneous

Slast has some structure

71

1001010 10011011

Some structural propertiesSome structural properties

CbaDDc DaBABccab

S


SSlast

XBW

C

B A B

D c

c a

b a D

c

D a

b

C

B A B

D c

c a

b a D

c

D a

b

Two useful properties:

• Children are contiguous and delimited by 1s

• Children reflect the order of their parents

B

72

1001010 10011011

XBW is navigationalXBW is navigational

CbaDDc DaBABccab

S


SSlast

XBW

C

B A B

D c

c a

b a D

c

D a

b

C

A B

D c

c a

b a D

c

D a

b

XBW is navigational:

• Rank-Select data structures on Slast and S

• The array C of || integers

B

Get_children

Rank(B,S)=2

Select in Slast the 2° item 1from here...

A 2B 5C 9D 12

C

73

1001010 10011011

XBW is searchable XBW is searchable (count subpaths)(count subpaths)C

B A B

D c

c a

b a D

c

D a

b

CbaDDc DaBABccab

S


SSlast

XBW-index

Inductive step:

Pick the next char in [i+1]i.e. ‘D’

Search for the first and last ‘D’ in S[fr,lr]

Jump to their children

fr

lr

= B D

[i+1]

Rows whoseS starts with ‘B’

Their childrenhave upwardpath = ‘D B’

A 2B 5C 9D 12

lr

fr

XBW is searchable:

• Rank-Select data structures on Slast and S

• Array C of || integers

C

2 occurrences of

because of two 1s

lecture 8: xml compression comp9319. 2 semistructured data / xml zsemistructured => yloosely...

Documents

data values

data guiderooto1

data loading

data sizegood

xml data file

data graphthe paths

example of dag data

original data graphtreeeasy