![Page 1: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/1.jpg)
Change Detection in XML Documents
using Semantic Identifiers
BY
KAILAASH BALACHANDRAN
1
![Page 2: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/2.jpg)
Outline
Motivation
Introduction
The Approach
• Identifiers
• 2-step Algorithm
• Axioms
Semantic Change Detection
• Finding Identifiers
• Matching Nodes
Examples
Conclusion
2
![Page 3: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/3.jpg)
Motivation(1)
Fig.2. Version 2
<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher><salesprice>$35</salesprice>
<isbn>0385504209</isbn>
</book>
<book>
<title>Angels & Demons</title>
<publisher>Pocket Star</publisher><price>$56</price>
</book></author>
Fig.1. Version 1
<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher><price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher>
<price> $56</price></book>
</author>
3
![Page 4: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/4.jpg)
Motivation(1)
Fig.2. Version 2
<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher><salesprice>$35</salesprice>
<isbn>0385504209</isbn>
</book>
<book>
<title>Angels & Demons</title>
<publisher>Pocket Star</publisher><price>$56</price>
</book></author>
Fig.1. Version 1
<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher><price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher>
<price> $56</price></book>
</author>
4
![Page 5: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/5.jpg)
Motivation(2)Fig.3. Version 3
<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name></author>
<price> $35</price>
</book>
</publisher>
<publisher>Pocket Star
<book>
<title>Angels and Demons</title>
<author><name>Dan Brown</name>
</author>
<price> $56</price>
</book> </publisher>
Fig.1. Version 1
<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<price> $35 </price>
</book>
<book>
<title>Angels and Demons</title><publisher>Pocket Star</publisher>
<price> $56</price>
</book>
</author>
5
![Page 6: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/6.jpg)
Motivation(2)Fig.3. Version 3
<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name></author>
<price> $35</price>
</book>
</publisher>
<publisher>Pocket Star
<book>
<title>Angels and Demons</title>
<author><name>Dan Brown</name>
</author>
<price> $56</price>
</book> </publisher>
Fig.1. Version 1
<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher><price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher>
<price> $56</price>
</book>
</author>
6
![Page 7: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/7.jpg)
Motivation(3)
Disadvantages of Structural detection approach:
Difficult to associate elements in different versions.
Break down when the changes are significant.
Affects Incremental Evaluation.
High cost of change of data.
7
![Page 8: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/8.jpg)
Introduction
What is Semantic Based Change Detection?
A process of Identifying changes between successive versions of a document
based on its semantics, rather than on the structure of the document.
The Approach:
1. Find Semantic Identifier for each node in the XML model.
2. Compute these Identifiers to associate nodes across multiple versions.
8
![Page 9: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/9.jpg)
Identifiers
Type is list of labels from root to element separated by a ‘/’.
Identifier serves to distinguish elements of same type.
Two nodes x and y, are semantically the same if and only if their identifiers evaluate to
the same result.
Node
x
Node
y
Same Result
Eval(x,L) = Eval(y,L)
where,• x,y are the nodes,
• List of Expressions L = { E1,E2…En}
9
![Page 10: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/10.jpg)
Identifiers
Local Identifier: An identifier is local if it evaluates to descendants of the context
node, otherwise it is non-local.
Version 1:
<author>
<name>Dan Brown</name><book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher><price> $56</price>
</book>
</author>
Version 3:
<publisher>Doubleday
<book><title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author><price> $35</price>
</book>
</publisher>
<publisher>Pocket Star <book>
<title>Angels and Demons</title><author>
<name>Dan Brown</name>
</author><price> $56</price>
</book> </publisher>
10
![Page 11: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/11.jpg)
Identifiers
Local Identifier: An identifier is local if it evaluates to descendants of the context
node, otherwise it is non-local.
Version 1:
<author>
<name>Dan Brown</name><book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher><price> $56</price>
</book>
</author>
Version 3:
<publisher>Doubleday
<book><title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author><price> $35</price>
</book>
</publisher>
<publisher>Pocket Star <book>
<title>Angels and Demons</title><author>
<name>Dan Brown</name>
</author><price> $56</price>
</book> </publisher>
<name> is
local<name> is
non-local
11
![Page 12: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/12.jpg)
Identify nodes based on its
Semantics
The Algorithm
Phase 1:
Bottom up fashion.
Identifies all local identifiers.
Semantically different nodes are identified.
Phase 2:
Runs recursively and identifies non-local identifiers.
All semantically distinct nodes are found.
Any remaining node is a redundant copy of another node in the document.
12
![Page 13: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/13.jpg)
Identify nodes based on its
Semantics(Phase 1)
<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author></book>
</publisher>
<publisher>Pocket Star
<book>
<title>Angels and Demons</title><author>
<name>Dan Brown</name>
</author></book> </publisher>
Semantically different.
Axiom 1: Nodes that are structurally different are semantically different.
13
![Page 14: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/14.jpg)
Identify nodes based on its
Semantics(Phase 1)
<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author></book>
</publisher>
<publisher>Pocket Star
<book>
<title>Angels and Demons</title><author>
<name>Dan Brown</name>
</author></book> </publisher>
Are they semantically the same?
Axiom 1: Nodes that are structurally different are semantically different.
14
![Page 15: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/15.jpg)
Identify nodes based on its
Semantics(Phase 2)
<publisher>Doubleday
<book>
<title>The Da Vinci Code</title><author>
<name>Dan Brown</name>
</author>
</book>
</publisher>
<publisher>Pocket Star
<book>
<title>Angels and Demons</title><author>
<name>Dan Brown</name>
</author>
</book> </publisher>
No, because they’re in context of two
different books
Axiom 2: Nodes that are structurally
identical are semantically identical
if and only if their respective parents are semantically identical or if they
are both root nodes.
15
![Page 16: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/16.jpg)
Semantic Change Detection
How to handle structural changes ?
Assumption: Identifying information will remain nearby.
X
Y Z YX
A
Z
Version 1 Version 2
16
![Page 17: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/17.jpg)
Semantic Change Detection
Type Territory : The territory of a type T is the set of all text nodes that are descendants of the least common ancestor (lca) of all of the type T nodes.
Within the type territory is the territory controlled by individual nodes of that
type.
Node Territory : The territory of a type T node p is the type territory of T excluding all text nodes that are descendants of other type T nodes.
17
![Page 18: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/18.jpg)
Node and Type Territory
document root
lca (p)
p1
p2
p3
node territory of p2node territory of p1
Node territory
type territory of p
18
![Page 19: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/19.jpg)
Finding IdentifiersVersion 1:
<bib>
<author><name>n1</name>
<book>
<title>t1</title>
<publisher>p1</publisher>
</book>
</author><author><name>n2</name>
<book>
<title>t2</title>
<publisher>p2</publisher>
</book>
<book>
<title>t1</title><publisher>p1</publisher>
</book></author>
</bib>
Version 2:
<bib>
<pub> p1
<book>
<title>t1</title><author>
<name>n1</name>
</author>
<book>
<pub> p2
<book>
<title>t2</title>
<author>
<name>n2</name>
</author>
<book>
19
![Page 20: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/20.jpg)
Identifiers
<bib>
<author><name>n1</name>
<book>
<title>t1</title>
<publisher>p1</publisher>
</book>
</author>
<author><name>n2</name>
<book>
<title>t2</title><publisher>p2</publisher>
</book>
<book>
<title>t1</title><publisher>p1</publisher>
</book></author>
</bib>
Node IDENTIFIER
book (../author/name/text(),
title/text())
20
![Page 21: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/21.jpg)
IdentifiersValues of Identifiers for <book> in Version 1
<bib>
<author><name>n1</name>
<book>
<title>t1</title><publisher>p1</publisher>
</book>
</author>
<author><name>n2</name>
<book>
<title>t2</title>
<publisher>p2</publisher>
</book>
<book>
<title>t1</title>
<publisher>p1</publisher>
</book></author>
</bib>
Value of Identifier = n1, t1
Value of Identifier = n2, t2
Value of Identifier = n2, t1
21
![Page 22: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/22.jpg)
IdentifiersValues of Identifiers for <book> in Version 2
<bib>
<pub> p1
<book>
<title>t1</title><author>
<name>n1</name>
</author>
</book>
</pub>
<pub> p2
<book>
<title>t2</title>
<author>
<name>n2</name>
</author>
</book></pub>
</bib>
22
![Page 23: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/23.jpg)
IdentifiersValues of Identifiers for <book> in Version 2
<bib>
<pub> p1
<book>
<title>t1</title><author>
<name>n1</name>
</author>
</book>
</pub>
<pub> p2
<book>
<title>t2</title>
<author>
<name>n2</name>
</author>
</book></pub>
</bib>
Value of Identifier = p1, t1
Value of Identifier = p2, t2
23
![Page 24: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/24.jpg)
Identifiers
Node IDENTIFIER
book (top) n1 , t1
book
(middle)n2 , t2
book
(bottom)
n2 , t1
Values of Identifiers for <book> in both versions:
Node IDENTIFIER
book 1 (top) p1 , t1
book 2
(bottom)p2 , t2
Version 1 Version 2
How to map both ?
24
![Page 25: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/25.jpg)
Matching
Admits: q admits p if and only if q is in the node territory of p.
Nodes p and q are matched if and only if p and q admit each other.
Consider nodes p and q that reside in different versions Vp and Vq.
q1, q2….qn
q1, q2….qn
Node q in Vq Node p in Vp
25
![Page 26: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/26.jpg)
Semantic Change Detection
n1
author author
namebook
name bookbook
bib
title pub n2
t1 p1 t2 p2 t1
pubtitle title pub
p1
bib
pub pub
p1 book p2 book
titleauthor author
title author
t1name name
t2 name
n1 n2 n2
Version 1
Version 2
Book matches:
26
![Page 27: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/27.jpg)
Semantic Change Detection
n1
author author
name bookname
book book
bib
title pub n2
t1 p1 t2 p2 t1
pubtitle title pub
p1
bib
pub pub
p1 book p2 book
titleauthor author
title author
t1name name
t2name
n1 n2 n2
Version 1
Version 2
Book matches:
admits
27
![Page 28: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/28.jpg)
Semantic Change Detection
n1
author author
name bookname
book book
bib
title pub n2
t1 p1 t2 p2 t1
pubtitle title pub
p1
bib
pub pub
p1 book p2 book
titleauthorauthor
title author
t1name name
t2name
n1 n2 n2
Version 1
Version 2
Book matches:
Node match
28
![Page 29: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/29.jpg)
Semantic Change Detection
n1
author author
name bookname
book book
bib
title pub n2
t1 p1 t2 p2 t1
pubtitle title pub
p1
bib
pub pub
p1 book p2 book
titleauthorauthor
title author
t1name name
t2name
n1 n2 n2
Version 1
Version 2
Book matches:
Node match
29
![Page 30: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/30.jpg)
Semantic Change Detection
n1
author author
name bookname
book book
bib
title pub n2
t1 p1 t2 p2 t1
pubtitle title pub
p1
bib
pub pub
p1 book p2 book
titleauthor author
title author
t1name name
t2name
n1 n2 n2
Version 1
Version 2
Author matches:
30
![Page 31: Schemaless Change detection in XML Documents using Semantic Identifiers](https://reader034.vdocuments.net/reader034/viewer/2022052600/55863655d8b42a4a348b459a/html5/thumbnails/31.jpg)
Conclusion
Semantic change detection technique.
• Find identifiers for each node in the XML document
• Associate nodes across versions.
Information that identifies an element is conserved across changes.
Time complexity is O(n*log(n))
We can match nodes even when structural changes are significant.
31