japanjune 20031 the correction of xml data université paris ii & lri michel de rougemont...

42
japan june 2003 1 The correction of XML data Université Paris II & LRI Michel de Rougemont [email protected] http://www.lri.fr/ ~mdr 1. Approximation and Edit Distance 2. Testers and Correctors 3. Correcting regular binary trees 4. Applications to XML Practical corrector 5. Relative value of documents

Upload: juliet-golden

Post on 20-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 1

The correction of XML data

Université Paris II & LRI

Michel de Rougemont

[email protected]

http://www.lri.fr/~mdr

1. Approximation and Edit Distance2. Testers and Correctors3. Correcting regular binary trees4. Applications to XML

Practical corrector5. Relative value of documents

Page 2: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 2

1. Relations Dist (R,S) = # x :

if Dist(R,S) <

2. Edit-distance

3. Trees: Tree-Edit-Distance Min # Deletions,

Insertions

Approximation

)()( xSxR

SR )(. Raritén

Left-deletion

Left-insertion

Page 3: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 3

Binary trees : p-Distance allows permutation

Classical Tree-Edit-Distance

Dist(T1,T2) =2 p-Dist (T1,T2) =1

Dist (T, L) = Min Dist (T,T’)

a

e

b

c d

a

e

b

c

a

e

b

c d

fe

Deletion

Insertion

LT '

Page 4: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 4

1. Satisfiability : Tree |= F

2. Approximate satisfiability

Tree |= F

Image on a class K of trees

Approximate satisfiability

F FF

F fromfar

Page 5: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 5

Logic, testers, correctors

A Tester decides |= for a formula F.

A Corrector takes a tree T close to a language L and find T’ in L close to T.

This is possible if F follows a simple logic.

Theorem. there is linear time corrector for regular binary trees and a constant distance.

Given a tree T, k- close to a regular language L, we find in linear time T’ in L, c.k -close to T.

General problem: given a language L defined in some Logic, find a corrector.

Theorem. (implicit in Alon and al. FOCS2000) There is a linear time corrector for regular words and distance

Application to Model-Checking (LICS2002)

n .

Page 6: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 6

Simple example

Tester for 0+ 1* 0+

Types of segments:

000000011111110000010000 probablyaccepted011110000000110111 rejected with highprobability

0 01

0000011111000111110001100

0 0

Corrector for 0+ 1* 0+ 00000001111110000100000 *

Page 7: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 7

• Tree-automata• Logical definability on trees• Tree grammar• Regular expression

Regular Trees

r(a,b(a,b(a,b(a,b(a,b(a,b)....) r(a(a,b(a,b(a(a,b),b)....),b)

Page 8: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 8

• (q0, q0) q1• (q0,q1) q1

Tree automata

q0 q0

q0

q0

q0

q0

q1

q1

q1

q1

q1

q0 q0

q0q1

q2

(q1,q1)q2

(q1,q0)q2

(q2,-) q2

(-,q2) q2)1,,0,( qqQA

Page 9: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 9

Definition : a subtree t is feasible for L if there are subtrees (for its leaves) which reach states (q1...ql) such that the state of the root q=t(q1...ql) can reach an accepting state (in the automaton for L).

A subtree is infeasible if it is not feasible

Feasible and infeasible subtrees

feasible

infeasible

Page 10: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 10

Fact . If then the number of unfeasible subtrees of length a is O(n).

Fact. If the distance is small, there are few infeasibles trees.

Intuition : make local corrections at the root of the infeasible trees

Infeasible subtrees

nLT .),(Distance

Page 11: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 11

Phase 1 : (Bottom-up) Marking of * nodes, roots of infeasible subtrees.

Phase 2 : (Top-down) Recursive analysis of the * subtrees to make root accept.

Phase 3 : Local corrections

Structure of the corrector

q0

q1

Page 12: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 12

Phase 1 : bottom-up marking

Definitions: 1. A terminal *-node is the first sink node of a run2. A * subtree of a node v is the subtree whose root is v reaching leaves or *-node 3. A node v is a *-node if its state is a sink node when all possible reachable states replace the *-nodes of its *-subtree.4. Compute the size of the subtrees

**

Runs withall possible reachable states (q,q’) reach a sink.

*

O(n) procedure.

Lemma 1: If Dist(T,L)<k, there are at most k *-nodes.

Page 13: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 13

Phase 2 : top-down possible states

**

Let (q,q’) a possible choice at the top *-subtree.

Let q’’ a possible state for the *-node of the left *-subtree

*

q1 q2

q’’ instead of *

Correction needed.

Page 14: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 14

Case 1: One essentially-connected component.

Case 2: General case

Many components

Case analysis of the automaton

Page 15: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 15

Lemma: if (q1,q2,q’’) are in the same connected component, there is a finite subtree t which can correct.

Case a : there is a transition (q,q’) to q’’ with both q,q’ in C: there is a finite tree t1 from q1 to q, a finite tree t2 from q2 to q’ and the correction is:

Case 1: one component

q1 q2

q’’

q q’

q’’

q1q2

t2t1

Page 16: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 16

Case b : there is a transition (q,q’) to q’’ with one of q or q’ being q0: suppose q=q0. The correction uses t2 and cut the left branch.

Case c: there is a transition (q0,q0) to q’’. The correction cuts both branches.

Case 1: b and c

q1 q2

q’’q0 q’

q’’

q2

t2

q1 q2

q’’

q0 q0

q’’

Page 17: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 17

Correction rules

q1 q2 q q’ q’’

q in C

q’ in C

q’’

q0 q’ q’’

q1 q2

q’’ instead of *

Action

Insert,

Insert

Cut,

Insert

Page 18: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 18

Hypothesis : q1 in Ci q2 in Cj q’’ in Ck

Case a: P such that Ci < Ck and Cj < Ck

Find t1 and t2 as in case 1.a

Case 2 : many components

q1 q2

q’’

q q’

q’’

q1q2

t2t1

Page 19: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 19

Case b,c : P such that Ci >Ck and Cj < Ck Find t2 and let Cp=inf(Ci,Ck). Cut the left

branch until Cp.

Case d: P such that Ci >Ck and Cj > Ck Let Cp=inf(Ci,Ck). Cut the left branch until Cp.

Let Cq=inf(Cj,Ck). Cut the right branch until Cq.

Case 2: b and c

q1 q2

q’’ q’

q’’

q2

t2

q1 q2

q’’ q’’

Page 20: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 20

Correction rules

q1 C1

q2

C2

Q

C

q’

C’

q’’

C’’

C1<

C’’

C2<

C’’

C1<C

C2<

C’

q’’

… … …. …. ….

q1 q2

q’’ instead of *

Action

Insert,

Insert

….

Page 21: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 21

Fact 1: finitely many insertions

Fact 2: deletions less predictable

Lemma: If the cut is large, than the distance must be large.

Analysis of the corrector

General Corrector:

1. Do the inductive Marking bottom-up.

2. Apply the recursive analysis of compatible states top-down.

3. For each transition (q,q’) -> q’’ apply the correction, compute the distance and select the rule with smallest distance

4. Select the * states with Minimum Dist..

Procedure is O(n), exponential in k and size(Q)

Page 22: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 22

Theorem: If Dist(T,L) <k, the general corrector finds T’ such that Dist(T,T’) <c.k.

Proof :

# *-nodes < k

Case 1: 0 *-node: no correction

Case 2: at least 1 *-node. Looking at all possible k-variations will correct the errors in the *-subtree and diminish the *-nodes.

General result

Page 23: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 23

Labelled trees of large degree. Structure given by a « grammar », or DTD.

Generalization of automata:1. Unranked tree automaton2. Tree-walking automaton

Method: Code an unranked labelled tree with a binary labelled tree.

Advantage: the correction table is FINITE.

Theorem: If Dist(T,L) <k, the general corrector finds T’ such that Dist(T,T’) <c.k.

Unranked trees: XML

Page 24: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 24

Applications to XML

DTD

<?xml version='1.0' ?><!ELEMENT book (chapter*,title,author)><!ELEMENT chapter (title,para*)><!ELEMENT title (#PCDATA)><!ELEMENT para (#PCDATA)><!ELEMENT author (#PCDATA)>

Binary Normal Form

l -> l1, al1 -> c1, t

c1 -> c, c1c1 -> -c -> t, p1

p1 -> p, p1p1 -> -

a -> datat -> datap -> data

Page 25: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 25

XML tree decomposition

XML file transformed into a binary labelled tree.

Page 26: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 26

XML file with errors

Page 27: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 27

Corrected XML file

No ambiguities on the possible states of q’’

Immediate correction!

Page 28: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 28

XML Correction rules

q1 q2 q q’ q’’

- p1 t p1 c

… … - - -

q1 q2

q’’ instead of *

Action

Insert,

Link

Delete,

Delete

Page 29: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 29

Parser: Xerces, Tree structure : DOM

Phase 1: look at the parent node of *-node. Propose tags for * (c or f)

Phase 2: for each proposal, compute the distance.

*=c, distance=1, replacing c with b.

*=f, distance=2, replacing c with b

and adding an a leaf.

Choose the 1st solution.

Java Implementation

a b c

* b a

d

a

DTD: d (c,b,a) or (f,b,a) c (a,b,b) f (a,b,b,a)

Page 30: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 30

Relative value of documents

• Given a DTD, mark the Web documents as follows:– Infinity if there are far– Dist(Document,DTD)=i

• Provides a relative valued landscape. Works for boolean combinations

• Generalize to – Min{ Dist(D,DTD’) : }'DTDDTD

Page 31: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 31

Distance on words and trees

• On words, how can one compute– Dist(w,w’), a P-problem– Is is possible in less than O(n) ?

• Yes, STOC 2003

– Dist(w,L) and Dist(L,L’)

• Given two trees, how can one compute:– Dist(T,T’) P on ordered trees and

NP-complete on unordered trees– p-Dist(T,T’) NP-complete.

Page 32: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 32

Conclusion

• Testers and Correctors– Testers for approximate

verification– Correctors

• Trees– Regular trees are testable– If T is at distance less than k,then

we can correct it.• Theoretical algorithms

• Practical algorithms

Page 33: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 33

Testers, Correctors and formal

verification

Two different views of logical verification:

1. Formal verification. How can we check if a program satisfies a specification?

Logical proof: theorem proving, model checking

2. Design a tester for the specification (closer to practice: Windows 95 to XP !) (Blum & Kanan)

3. Combine the two approaches to approximately verify a specification (LICS 2002, Sylvain’s thesis)

Page 34: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 34

Testers

Self-testers and correctors for Linear Algebra Blum & Kanan 1985s

Testers for graph properties : k-colorabilityGoldreich and al. 1995s

graph properties have testersAlon and al. 1999

Regular languages have testersAlon and al. 2000s

Testers for Regular tree languages (Mdr and Magniez)

Corrector for regular trees!

2

F

F fromfar

F fromfar k

Page 35: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 35

Blum’s Checker and Tester

Checker for f (Blum, Kannan, ~1990)

P

C

x y

A checker is a probabilistic program with an oracle P such that for all x,k :

if P=f, C(x,k) = Correct

If P(x)!=f(x), Prob[ C(x,k) =Buggy] >1- ½^k

CorrectBuggy

Page 36: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 36

• Distance d(f,g) = | {x : f(x) != g(x)}| / | D|

• A self-tester for f is a probabilistic program T(P, ) such that :

– If d(P,f)=0, then T(P, )=Correct– If d(P,f) > then T(P, )=Buggy

• Corrector. Division (x,y) : Majority { x.r /y.r : r random.}

Self-testing

Page 37: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 37

Property testing on graphs

H random subgraph

G Bipartite

2-colorable

H

2-Colorability

G bipartite Prob [ H is bipartite] =1

G is -far from bipartite Prob [ H is non-bipartite] > 2/3

),( ofset theis EDGK

Page 38: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 38

Property testing on graphs

3-Colorability

G 3-colorable Prob [ H is 3-colorable] =1

G is -far from 3-colorable Prob [ H is non 3-colorable] > 2/3

Generalization to k-colorability

G

H random subgraph

Page 39: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 39

• Which graphs (and matrices) properties have testers?

– Alon and al., STOC 99: Sigma 2 testers

• Compression.

Property testing and descriptive complexity

?)( gsatisfiesU

?)( gsatisfiesV -equivalent

Page 40: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 40

Property testing on words

F : 0*1*

W |= F Prob [ H |= F’ ] =1

W is -far from F Prob [ H |= not F’] >2/3

H random subword

),,( ofset theis UDWK

Word W

Page 41: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 41

A testable regular property

W |= F Prob [ H |= F’ ] =1

W is -far from F Prob [ H |= not F’] >2/3

Many 10 appear in W. Repeating the test will detect it with high probability

H random subword000011110111 ..... F’

Word W

How can we verify F : 0*1* ?

distance(w,w’) =Hamming distance

Page 42: Japanjune 20031 The correction of XML data Université Paris II & LRI Michel de Rougemont mdr@lri.fr mdr 1.Approximation and Edit Distance

japan june 2003 42

Regular properties are testable

Theorem. Regular languages are testable.

N. Alon, M. Krivelevich, I. Newman, M. SzegedyFOCS 99.

General idea : if a word is far from a regular language, it contains many subwords which areinfeasible and can be detected.

Theorem. Dyck languages are not testable