validating streaming xml documents luc segoufin & victor vianu presented by harel paz
Post on 20-Dec-2015
222 Views
Preview:
TRANSCRIPT
Validating Streaming XML Documents
Luc Segoufin & Victor Vianu
Presented by Harel Paz
The Challenge XML becoming a standard for data
exchange on the Web. Need: on-line processing of large
amounts of data in XML format, using limited memory.
Our focus: validating XML documents against given DTDs.
Validating Streaming XML Documents
Restrictions over the validation: In a single pass. Using a fixed amount of memory,
depending on the DTD.
Input stream...<u><v>...</v><v><w>...<w></v>
startaccept
FSA
Yes/No
FSA
The Problem in 2 Flavors There are 2 flavors to the problem:
Strong validation: validation that includes checking well-formedness.
Validation: checking satisfaction of the DTD, under the assumption that the input is a well-formed XML document.
Tree Document XML documents are
abstracted by “tree documents”.
A tree document over a finite alphabet is a finite unranked tree with labels in and an order on the children of each node.
r
c
aa
b c b c
t
String Representation XML documents are a
string representation of trees using opening and closing tags for each element.
For each , represents the
opening tag. represents the closing
tag for . Notation: .
aa
aa
}|{ aa
raccbabaccbcrabc
r
c
aa
b c b c
DTDs A DTD consists of an extended context-
free grammar over alphabet Σ. DTD :
r a* a bc b c? c є
d
d A tree document over Σ satisfies a DTD if it is a derivation tree of the grammar.
r
c
aa
b c b c
T
satisfies T
d
DTDs – cont’ Each DTD has a unique rule
for each symbol . denotes the regular expression.
is the language over consisting of the string representations of all tree documents satisfying .
aRa a
aR
d
)(dL
Strong Validation of Streaming XML
Documents The problem: validating an XML
document with respect to a given DTD.
Need to characterize the DTDs , for which can be recognized by an FSA.
Such DTDs are called strongly recognizable.
d)(dL
Strong Validation – Example 1
DTD d: r a a a?
. is not regular, so cannot be
strongly validated by an FSA. is not strongly recognizable.
)(dL
d
r
a
a
.
.
d}1|{)( nraradL nn
Strong Validation – Example 2
DTD d: r a* a b|c
. is regular, so is
strongly recognizable.
}*))|(({)( raccbbardL )(dL d
r
aa . .
b c
More Definitions Let be a DTD over . The dependency graph of , , is
the graph constructed as follows: Its set of vertices is . For each rule in , there is an
edge from to , for each occurring in .
dG
aR
d
aRa a b b
More Definitions (cont’) Two labels, and , are mutually
recursive if they belong to some cycle of . is recursive if it is mutually recursive
with itself. DTD is non-recursive iff is acyclic. A DTD is fully recursive if all labels
from which recursive labels are reachable in are mutually recursive.
a
a
dGd
dG
d
b
dG
Dependency Graph – Examples
DTD d: r a a a?
r
adG
r
a
b c
dG
DTD d: r a* a b|c
is non-recursive.d
is not acyclic. is not fully recursive. is recursivea
dGd
Characterization of Strongly Recognizable DTDs
Proof sketch: If is a strongly recognizable DTD, there is an
FSA recognizing exactly . Suppose towards a contradiction that is recursive, and show using the pumping lemma that the above FSA accepts also non well-balanced strings.
If is non-recursive, an algorithm to build an FSA recognizing is given.
Theorem 3.1 (partial): A DTD is strongly recognizable iff it is non-recursive.
d)(dL
d
)(dLd
Validating Well-Formed XML Documents
The problem: validating an XML document with respect to a given DTD , assuming the XML document is well-formed. Validation using an FSA.
Such DTDs are called recognizable. The requirement that should be
regular is now too strong. The FSA should only work correctly on well-
balanced strings representing trees.
d
)(dL
Validation - Example 1 DTD d:
r a a a?
is not strongly recognizable. But, it is recognizable:
If the input is known to be well balanced, the FSA should just check that the string is of the form (more precisely ).
rara **
d
}1|{)( nraradL nn
raaraa **
Validation - Example 2 DTD d:
a (ab|ca|є) b є c є
is not recognizable. An FSA cannot store
enough information to recall, when it reads , whether the corresponding node has a left sibling (in which is not allowed to its right).
a
a b
a b
c a
c a
a
bc
d
Characterizing Recognizable DTDs
Which DTDs are recognizable? Non-recursive DTDs. What about recursive DTDs?
Not a trivial question. Are there any necessary conditions of
being a recognizable DTD? Are there any sub-groups of DTDs for which
the necessary conditions are also sufficient?
d
wvu ,,,,
Lemma 4.2: Let be a recognizable DTD. Then the following hold, where are words over while (possibly subscripted) are individual symbols:Let be a positive integer and , be mutually recursive symbols of (not necessarily distinct). If , and for , then must be in .
zx,
k ii zx , ki 1d
11 zRx 1zk Rx
iziiiii Rwxvxu 1 ki 11z
R kk xvxvx ...221
d
Necessary Condition for a Recognizable DTD
, , , ,u v w
Fully Recursive DTDs The necessary condition stated in
lemma 4.2 in order for a DTD to be recognizable, is also sufficient when the DTD is fully recursive. Next, we’ll see how to construct an
FSA for a DTD , which accepts all words in (and possibly more).
For fully recursive DTDs satisfying the conditions of Lemma 4.2, accepts precisely the words in (and possibly also non well-balanced words).
dA
)(dLdA
)(dLd
The Standard FSA Let be a DTD over alphabet . Equivalence relation on
Equivalence classes are the strongly connected components of .
Let be a partial order on the classes of , where iff for some and there is an edge from to in . may have several maximal classes,
but only one minimum class.
d
dG BA Aa
Bb a bdG
Example DTD d:
r aa a a?
The classes of , are and .
.}{r
}{}{ ar
}{ar
a
dG
{ }rA
{ }aA
a a
{ }aA
aa
r r
Example – cont’ DTD d:
r aa a a?
aao fq 1, af2
aaA
a a
Aaa
o fq 1,A af2
Constructing FSA of class {a}’s string
representation
a
Constructing FSA for aR
For edge in add to : . .
),,( qbq
0( , , )q b q( , , )f b q
aA AA
Example – cont’ DTD d:
r aa a a?
roq
arA rf
a
a a a a a
roq rf
RA
aaao fq 1,
a a
af2aa
o fq 1,
a a
a af2
a
Example – cont’
The above FSA recognizes all well-balanced words produced by the above DTD.
But also other well-balanced words (such as ). There is no automaton recognizing this DTD.
DTD d: r aa a a?
roq rf
a a a a adA
aaao fq 1,
a a
af2aa
o fq 1,
a a
a af2
a
rs g
r
raaaaaar
Theorem 4.1: The following are equivalent for each fully recursive DTD :
(i) is recognizable.(ii) satisfies the conditions of Lemma 4.2.(iii) The set of well-balanced strings accepted
by the FSA is precisely .
d
)(dLdA
dd
Recognizable Fully Recursive DTDs
Recognizable DTDs
Which DTDs are recognizable? Non-recursive DTDs. Fully recursive DTDs satisfying the
conditions of Lemma 4.2. And others…
But, characterization in the general case remains an open question.
Partial progress: necessary conditions for recognizability.
Alternative Validation Approaches
2 alternative approaches for validating DTDs that are not recognizable: Relax the constant memory
requirement. Refining the original DTD.
Validation with Bounded Stack
Relaxing the constant memory requirement. Use a stack whose depth is bounded in the depth
of an XML document. Validation done in a single deterministic pass.
Appealing approach in practice. For each DTD, there exists a deterministic
PDA that accepts precisely its language. Example- the DTD:
r aa a a?
Refining the DTD Refining a DTD means providing in the
tags additional information that can be used for validation.
DTD:1 2r a a
1 1 ?a a2 2 ?a a
DTD:r aa
?a a
The refined DTD can be validated by an FSA.
For every DTD, there exists such equivalent DTD of size quadratic, which is recognizable.
Example:
Summary First step towards the formal
investigation of processing streaming XML.
Provided conditions under which validation can be done in a single pass and constant memory, using an FSA.
Considered alternative approaches, when validation using an FSA is not possible.
Appendix
The Standard FSA Construction
The Standard FSA is inductively constructed starting
from the maximal elements of . Let be a maximal element of . For each regular expression ( ),
a non-deterministic FSA is built. Disjoint states for different ’s. Initial state of is , while its final
states are
dA
cAcR
cA c
CCc
cA cq0,..., 21
cc ff
The Standard FSA – cont’ Build :
Its states are the union of the states of the FSAs for .
Transitions- for each transition of , add to the transitions:
for the initial state of . for each final state of .
Cc
cA CA),,( qbq
CA
),,( 0qbq),,( qbf
0qf
bAbA
cA
must belong to
b
C
is a maximal element ofC
The Standard FSA – cont’ Build for non-maximal elements of
, when all FSAs of elements , such that are already constructed: Unlike the maximal elements case, has
transitions , where (i.e., ). For such transitions, we add to :
A new disjoint copy of . for the initial state of . for each final state of .
CA EA E
EC
cA),,( qeq CeEe
CAEA
),,( 0qeq),,( qef
eA
eA0qf
The Standard FSA – cont’ The final FSA is obtained by
adding to the FSA of the minimum class (containing the root label ): A new start state with transition
for the start state of . A final state with transition
for each final state of .
dA
rs ),,( 0qrs
),,( grf0q
rAg
f rA
CA
Complexity of ‘s construction: . is the maximum size of an FSA for a
regular expression of . is the depth of the partial order .
Lemma 4.3: For each DTD , let be the automation described. We have:
(i) Every word in is accepted by .(ii) can be constructed from in
exponential time. d
d
dAdA)(dL
dA
dA )|(| ||dO|| d
|| d
The Standard FSA - Lemma
top related