validating streaming xml documents luc segoufin & victor vianu

38
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz

Upload: aden

Post on 19-Jan-2016

49 views

Category:

Documents


3 download

DESCRIPTION

Validating Streaming XML Documents Luc Segoufin & Victor Vianu. Presented by Harel Paz. The Challenge. XML becoming a standard for data exchange on the Web. Need: on-line processing of large amounts of data in XML format, using limited memory. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Validating Streaming XML Documents

Luc Segoufin & Victor Vianu

Presented by Harel Paz

Page 2: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

The Challenge XML becoming a standard for data

exchange on the Web. Need: on-line processing of large

amounts of data in XML format, using limited memory.

Our focus: validating XML documents against given DTDs.

Page 3: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Validating Streaming XML Documents

Restrictions over the validation: In a single pass. Using a fixed amount of memory,

depending on the DTD.

Input stream...<u><v>...</v><v><w>...<w></v>

startaccept

FSA

Yes/No

FSA

Page 4: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

The Problem in 2 Flavors There are 2 flavors to the problem:

Strong validation: validation that includes checking well-formedness.

Validation: checking satisfaction of the DTD, under the assumption that the input is a well-formed XML document.

Page 5: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Tree Document XML documents are

abstracted by “tree documents”.

A tree document over a finite alphabet is a finite unranked tree with labels in and an order on the children of each node.

r

c

aa

b c b c

t

Page 6: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

String Representation XML documents are a

string representation of trees using opening and closing tags for each element.

For each , represents the

opening tag. represents the closing

tag for . Notation: .

aa

aa

}|{ aa

raccbabaccbcrabc

r

c

aa

b c b c

Page 7: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

DTDs A DTD consists of an extended context-

free grammar over alphabet Σ. DTD :

r a* a bc b c? c є

d

d A tree document over Σ satisfies a DTD if it is a derivation tree of the grammar.

r

c

aa

b c b c

T

satisfies T

d

Page 8: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

DTDs – cont’ Each DTD has a unique rule

for each symbol . denotes the regular expression.

is the language over consisting of the string representations of all tree documents satisfying .

aRa a

aR

d

)(dL

Page 9: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Strong Validation of Streaming XML

Documents The problem: validating an XML

document with respect to a given DTD.

Need to characterize the DTDs , for which can be recognized by an FSA.

Such DTDs are called strongly recognizable.

d)(dL

Page 10: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Strong Validation – Example 1

DTD d: r a a a?

. is not regular, so cannot be

strongly validated by an FSA. is not strongly recognizable.

)(dL

d

r

a

a

.

.

d}1|{)( nraradL nn

Page 11: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Strong Validation – Example 2

DTD d: r a* a b|c

. is regular, so is

strongly recognizable.

}*))|(({)( raccbbardL )(dL d

r

aa . .

b c

Page 12: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

More Definitions Let be a DTD over . The dependency graph of , , is

the graph constructed as follows: Its set of vertices is . For each rule in , there is an

edge from to , for each occurring in .

dG

aR

d

aRa a b b

Page 13: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

More Definitions (cont’) Two labels, and , are mutually

recursive if they belong to some cycle of . is recursive if it is mutually recursive

with itself. DTD is non-recursive iff is acyclic. A DTD is fully recursive if all labels

from which recursive labels are reachable in are mutually recursive.

a

a

dGd

dG

d

b

dG

Page 14: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Dependency Graph – Examples

DTD d: r a a a?

r

adG

r

a

b c

dG

DTD d: r a* a b|c

is non-recursive.d

is not acyclic. is not fully recursive. is recursivea

dGd

Page 15: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Characterization of Strongly Recognizable DTDs

Proof sketch: If is a strongly recognizable DTD, there is an

FSA recognizing exactly . Suppose towards a contradiction that is recursive, and show using the pumping lemma that the above FSA accepts also non well-balanced strings.

If is non-recursive, an algorithm to build an FSA recognizing is given.

Theorem 3.1 (partial): A DTD is strongly recognizable iff it is non-recursive.

d)(dL

d

)(dLd

Page 16: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Validating Well-Formed XML Documents

The problem: validating an XML document with respect to a given DTD , assuming the XML document is well-formed. Validation using an FSA.

Such DTDs are called recognizable. The requirement that should be

regular is now too strong. The FSA should only work correctly on well-

balanced strings representing trees.

d

)(dL

Page 17: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Validation - Example 1 DTD d:

r a a a?

is not strongly recognizable. But, it is recognizable:

If the input is known to be well balanced, the FSA should just check that the string is of the form (more precisely ).

rara **

d

}1|{)( nraradL nn

raaraa **

Page 18: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Validation - Example 2 DTD d:

a (ab|ca|є) b є c є

is not recognizable. An FSA cannot store

enough information to recall, when it reads , whether the corresponding node has a left sibling (in which is not allowed to its right).

a

a b

a b

c a

c a

a

bc

d

Page 19: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Characterizing Recognizable DTDs

Which DTDs are recognizable? Non-recursive DTDs. What about recursive DTDs?

Not a trivial question. Are there any necessary conditions of

being a recognizable DTD? Are there any sub-groups of DTDs for which

the necessary conditions are also sufficient?

Page 20: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

d

wvu ,,,,

Lemma 4.2: Let be a recognizable DTD. Then the following hold, where are words over while (possibly subscripted) are individual symbols:Let be a positive integer and , be mutually recursive symbols of (not necessarily distinct). If , and for , then must be in .

zx,

k ii zx , ki 1d

11 zRx 1zk Rx

iziiiii Rwxvxu 1 ki 11z

R kk xvxvx ...221

d

Necessary Condition for a Recognizable DTD

, , , ,u v w

Page 21: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Fully Recursive DTDs The necessary condition stated in

lemma 4.2 in order for a DTD to be recognizable, is also sufficient when the DTD is fully recursive. Next, we’ll see how to construct an

FSA for a DTD , which accepts all words in (and possibly more).

For fully recursive DTDs satisfying the conditions of Lemma 4.2, accepts precisely the words in (and possibly also non well-balanced words).

dA

)(dLdA

)(dLd

Page 22: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

The Standard FSA Let be a DTD over alphabet . Equivalence relation on

Equivalence classes are the strongly connected components of .

Let be a partial order on the classes of , where iff for some and there is an edge from to in . may have several maximal classes,

but only one minimum class.

d

dG BA Aa

Bb a bdG

Page 23: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Example DTD d:

r aa a a?

The classes of , are and .

.}{r

}{}{ ar

}{ar

a

dG

{ }rA

{ }aA

a a

{ }aA

aa

r r

Page 24: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Example – cont’ DTD d:

r aa a a?

aao fq 1, af2

aaA

a a

Aaa

o fq 1,A af2

Constructing FSA of class {a}’s string

representation

a

Constructing FSA for aR

For edge in add to : . .

),,( qbq

0( , , )q b q( , , )f b q

aA AA

Page 25: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Example – cont’ DTD d:

r aa a a?

roq

arA rf

a

a a a a a

roq rf

RA

aaao fq 1,

a a

af2aa

o fq 1,

a a

a af2

a

Page 26: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Example – cont’

The above FSA recognizes all well-balanced words produced by the above DTD.

But also other well-balanced words (such as ). There is no automaton recognizing this DTD.

DTD d: r aa a a?

roq rf

a a a a adA

aaao fq 1,

a a

af2aa

o fq 1,

a a

a af2

a

rs g

r

raaaaaar

Page 27: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Theorem 4.1: The following are equivalent for each fully recursive DTD :

(i) is recognizable.(ii) satisfies the conditions of Lemma 4.2.(iii) The set of well-balanced strings accepted

by the FSA is precisely .

d

)(dLdA

dd

Recognizable Fully Recursive DTDs

Page 28: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Recognizable DTDs

Which DTDs are recognizable? Non-recursive DTDs. Fully recursive DTDs satisfying the

conditions of Lemma 4.2. And others…

But, characterization in the general case remains an open question.

Partial progress: necessary conditions for recognizability.

Page 29: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Alternative Validation Approaches

2 alternative approaches for validating DTDs that are not recognizable: Relax the constant memory

requirement. Refining the original DTD.

Page 30: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Validation with Bounded Stack

Relaxing the constant memory requirement. Use a stack whose depth is bounded in the depth

of an XML document. Validation done in a single deterministic pass.

Appealing approach in practice. For each DTD, there exists a deterministic

PDA that accepts precisely its language. Example- the DTD:

r aa a a?

Page 31: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Refining the DTD Refining a DTD means providing in the

tags additional information that can be used for validation.

DTD:1 2r a a

1 1 ?a a2 2 ?a a

DTD:r aa

?a a

The refined DTD can be validated by an FSA.

For every DTD, there exists such equivalent DTD of size quadratic, which is recognizable.

Example:

Page 32: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Summary First step towards the formal

investigation of processing streaming XML.

Provided conditions under which validation can be done in a single pass and constant memory, using an FSA.

Considered alternative approaches, when validation using an FSA is not possible.

Page 33: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Appendix

The Standard FSA Construction

Page 34: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

The Standard FSA is inductively constructed starting

from the maximal elements of . Let be a maximal element of . For each regular expression ( ),

a non-deterministic FSA is built. Disjoint states for different ’s. Initial state of is , while its final

states are

dA

cAcR

cA c

CCc

cA cq0,..., 21

cc ff

Page 35: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

The Standard FSA – cont’ Build :

Its states are the union of the states of the FSAs for .

Transitions- for each transition of , add to the transitions:

for the initial state of . for each final state of .

Cc

cA CA),,( qbq

CA

),,( 0qbq),,( qbf

0qf

bAbA

cA

must belong to

b

C

is a maximal element ofC

Page 36: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

The Standard FSA – cont’ Build for non-maximal elements of

, when all FSAs of elements , such that are already constructed: Unlike the maximal elements case, has

transitions , where (i.e., ). For such transitions, we add to :

A new disjoint copy of . for the initial state of . for each final state of .

CA EA E

EC

cA),,( qeq CeEe

CAEA

),,( 0qeq),,( qef

eA

eA0qf

Page 37: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

The Standard FSA – cont’ The final FSA is obtained by

adding to the FSA of the minimum class (containing the root label ): A new start state with transition

for the start state of . A final state with transition

for each final state of .

dA

rs ),,( 0qrs

),,( grf0q

rAg

f rA

CA

Page 38: Validating Streaming XML Documents Luc Segoufin & Victor Vianu

Complexity of ‘s construction: . is the maximum size of an FSA for a

regular expression of . is the depth of the partial order .

Lemma 4.3: For each DTD , let be the automation described. We have:

(i) Every word in is accepted by .(ii) can be constructed from in

exponential time. d

d

dAdA)(dL

dA

dA )|(| ||dO|| d

|| d

The Standard FSA - Lemma