validating streaming xml documents luc segoufin & victor vianu

Post on 19-Jan-2016

49 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Validating Streaming XML Documents Luc Segoufin & Victor Vianu. Presented by Harel Paz. The Challenge. XML becoming a standard for data exchange on the Web. Need: on-line processing of large amounts of data in XML format, using limited memory. - PowerPoint PPT Presentation

TRANSCRIPT

Validating Streaming XML Documents

Luc Segoufin & Victor Vianu

Presented by Harel Paz

The Challenge XML becoming a standard for data

exchange on the Web. Need: on-line processing of large

amounts of data in XML format, using limited memory.

Our focus: validating XML documents against given DTDs.

Validating Streaming XML Documents

Restrictions over the validation: In a single pass. Using a fixed amount of memory,

depending on the DTD.

Input stream...<u><v>...</v><v><w>...<w></v>

startaccept

FSA

Yes/No

FSA

The Problem in 2 Flavors There are 2 flavors to the problem:

Strong validation: validation that includes checking well-formedness.

Validation: checking satisfaction of the DTD, under the assumption that the input is a well-formed XML document.

Tree Document XML documents are

abstracted by “tree documents”.

A tree document over a finite alphabet is a finite unranked tree with labels in and an order on the children of each node.

r

c

aa

b c b c

t

String Representation XML documents are a

string representation of trees using opening and closing tags for each element.

For each , represents the

opening tag. represents the closing

tag for . Notation: .

aa

aa

}|{ aa

raccbabaccbcrabc

r

c

aa

b c b c

DTDs A DTD consists of an extended context-

free grammar over alphabet Σ. DTD :

r a* a bc b c? c є

d

d A tree document over Σ satisfies a DTD if it is a derivation tree of the grammar.

r

c

aa

b c b c

T

satisfies T

d

DTDs – cont’ Each DTD has a unique rule

for each symbol . denotes the regular expression.

is the language over consisting of the string representations of all tree documents satisfying .

aRa a

aR

d

)(dL

Strong Validation of Streaming XML

Documents The problem: validating an XML

document with respect to a given DTD.

Need to characterize the DTDs , for which can be recognized by an FSA.

Such DTDs are called strongly recognizable.

d)(dL

Strong Validation – Example 1

DTD d: r a a a?

. is not regular, so cannot be

strongly validated by an FSA. is not strongly recognizable.

)(dL

d

r

a

a

.

.

d}1|{)( nraradL nn

Strong Validation – Example 2

DTD d: r a* a b|c

. is regular, so is

strongly recognizable.

}*))|(({)( raccbbardL )(dL d

r

aa . .

b c

More Definitions Let be a DTD over . The dependency graph of , , is

the graph constructed as follows: Its set of vertices is . For each rule in , there is an

edge from to , for each occurring in .

dG

aR

d

aRa a b b

More Definitions (cont’) Two labels, and , are mutually

recursive if they belong to some cycle of . is recursive if it is mutually recursive

with itself. DTD is non-recursive iff is acyclic. A DTD is fully recursive if all labels

from which recursive labels are reachable in are mutually recursive.

a

a

dGd

dG

d

b

dG

Dependency Graph – Examples

DTD d: r a a a?

r

adG

r

a

b c

dG

DTD d: r a* a b|c

is non-recursive.d

is not acyclic. is not fully recursive. is recursivea

dGd

Characterization of Strongly Recognizable DTDs

Proof sketch: If is a strongly recognizable DTD, there is an

FSA recognizing exactly . Suppose towards a contradiction that is recursive, and show using the pumping lemma that the above FSA accepts also non well-balanced strings.

If is non-recursive, an algorithm to build an FSA recognizing is given.

Theorem 3.1 (partial): A DTD is strongly recognizable iff it is non-recursive.

d)(dL

d

)(dLd

Validating Well-Formed XML Documents

The problem: validating an XML document with respect to a given DTD , assuming the XML document is well-formed. Validation using an FSA.

Such DTDs are called recognizable. The requirement that should be

regular is now too strong. The FSA should only work correctly on well-

balanced strings representing trees.

d

)(dL

Validation - Example 1 DTD d:

r a a a?

is not strongly recognizable. But, it is recognizable:

If the input is known to be well balanced, the FSA should just check that the string is of the form (more precisely ).

rara **

d

}1|{)( nraradL nn

raaraa **

Validation - Example 2 DTD d:

a (ab|ca|є) b є c є

is not recognizable. An FSA cannot store

enough information to recall, when it reads , whether the corresponding node has a left sibling (in which is not allowed to its right).

a

a b

a b

c a

c a

a

bc

d

Characterizing Recognizable DTDs

Which DTDs are recognizable? Non-recursive DTDs. What about recursive DTDs?

Not a trivial question. Are there any necessary conditions of

being a recognizable DTD? Are there any sub-groups of DTDs for which

the necessary conditions are also sufficient?

d

wvu ,,,,

Lemma 4.2: Let be a recognizable DTD. Then the following hold, where are words over while (possibly subscripted) are individual symbols:Let be a positive integer and , be mutually recursive symbols of (not necessarily distinct). If , and for , then must be in .

zx,

k ii zx , ki 1d

11 zRx 1zk Rx

iziiiii Rwxvxu 1 ki 11z

R kk xvxvx ...221

d

Necessary Condition for a Recognizable DTD

, , , ,u v w

Fully Recursive DTDs The necessary condition stated in

lemma 4.2 in order for a DTD to be recognizable, is also sufficient when the DTD is fully recursive. Next, we’ll see how to construct an

FSA for a DTD , which accepts all words in (and possibly more).

For fully recursive DTDs satisfying the conditions of Lemma 4.2, accepts precisely the words in (and possibly also non well-balanced words).

dA

)(dLdA

)(dLd

The Standard FSA Let be a DTD over alphabet . Equivalence relation on

Equivalence classes are the strongly connected components of .

Let be a partial order on the classes of , where iff for some and there is an edge from to in . may have several maximal classes,

but only one minimum class.

d

dG BA Aa

Bb a bdG

Example DTD d:

r aa a a?

The classes of , are and .

.}{r

}{}{ ar

}{ar

a

dG

{ }rA

{ }aA

a a

{ }aA

aa

r r

Example – cont’ DTD d:

r aa a a?

aao fq 1, af2

aaA

a a

Aaa

o fq 1,A af2

Constructing FSA of class {a}’s string

representation

a

Constructing FSA for aR

For edge in add to : . .

),,( qbq

0( , , )q b q( , , )f b q

aA AA

Example – cont’ DTD d:

r aa a a?

roq

arA rf

a

a a a a a

roq rf

RA

aaao fq 1,

a a

af2aa

o fq 1,

a a

a af2

a

Example – cont’

The above FSA recognizes all well-balanced words produced by the above DTD.

But also other well-balanced words (such as ). There is no automaton recognizing this DTD.

DTD d: r aa a a?

roq rf

a a a a adA

aaao fq 1,

a a

af2aa

o fq 1,

a a

a af2

a

rs g

r

raaaaaar

Theorem 4.1: The following are equivalent for each fully recursive DTD :

(i) is recognizable.(ii) satisfies the conditions of Lemma 4.2.(iii) The set of well-balanced strings accepted

by the FSA is precisely .

d

)(dLdA

dd

Recognizable Fully Recursive DTDs

Recognizable DTDs

Which DTDs are recognizable? Non-recursive DTDs. Fully recursive DTDs satisfying the

conditions of Lemma 4.2. And others…

But, characterization in the general case remains an open question.

Partial progress: necessary conditions for recognizability.

Alternative Validation Approaches

2 alternative approaches for validating DTDs that are not recognizable: Relax the constant memory

requirement. Refining the original DTD.

Validation with Bounded Stack

Relaxing the constant memory requirement. Use a stack whose depth is bounded in the depth

of an XML document. Validation done in a single deterministic pass.

Appealing approach in practice. For each DTD, there exists a deterministic

PDA that accepts precisely its language. Example- the DTD:

r aa a a?

Refining the DTD Refining a DTD means providing in the

tags additional information that can be used for validation.

DTD:1 2r a a

1 1 ?a a2 2 ?a a

DTD:r aa

?a a

The refined DTD can be validated by an FSA.

For every DTD, there exists such equivalent DTD of size quadratic, which is recognizable.

Example:

Summary First step towards the formal

investigation of processing streaming XML.

Provided conditions under which validation can be done in a single pass and constant memory, using an FSA.

Considered alternative approaches, when validation using an FSA is not possible.

Appendix

The Standard FSA Construction

The Standard FSA is inductively constructed starting

from the maximal elements of . Let be a maximal element of . For each regular expression ( ),

a non-deterministic FSA is built. Disjoint states for different ’s. Initial state of is , while its final

states are

dA

cAcR

cA c

CCc

cA cq0,..., 21

cc ff

The Standard FSA – cont’ Build :

Its states are the union of the states of the FSAs for .

Transitions- for each transition of , add to the transitions:

for the initial state of . for each final state of .

Cc

cA CA),,( qbq

CA

),,( 0qbq),,( qbf

0qf

bAbA

cA

must belong to

b

C

is a maximal element ofC

The Standard FSA – cont’ Build for non-maximal elements of

, when all FSAs of elements , such that are already constructed: Unlike the maximal elements case, has

transitions , where (i.e., ). For such transitions, we add to :

A new disjoint copy of . for the initial state of . for each final state of .

CA EA E

EC

cA),,( qeq CeEe

CAEA

),,( 0qeq),,( qef

eA

eA0qf

The Standard FSA – cont’ The final FSA is obtained by

adding to the FSA of the minimum class (containing the root label ): A new start state with transition

for the start state of . A final state with transition

for each final state of .

dA

rs ),,( 0qrs

),,( grf0q

rAg

f rA

CA

Complexity of ‘s construction: . is the maximum size of an FSA for a

regular expression of . is the depth of the partial order .

Lemma 4.3: For each DTD , let be the automation described. We have:

(i) Every word in is accepted by .(ii) can be constructed from in

exponential time. d

d

dAdA)(dL

dA

dA )|(| ||dO|| d

|| d

The Standard FSA - Lemma

top related