cooperative xml (coxml) query answering

Cooperative XML (CoXML) Query Answering

2

Motivation XML has become the standard format for information

representation and data exchange An explosive increase in the amount of XML data

available on the web, e.g., Bills at the Library of Congress IEEE Computer Society’s publication SwissProt – protein sequence databases XMark – online auction data ….

Effective XML search methods are needed!

3

Challenges XML schema is usually very complex

E.g., the schema for the IEEE Computer Society publication dataset contains about 170 distinct tags and more than 1000 distinct paths

It is often unrealistic for users to fully understand a schema before asking queries

Exact query answering is inadequate and approximate query answering is more appropriate!

4

Approach: CoXML

Query

Approximate Answers

Cooperative XML Query Answering

XML Database Engine

XML Documents

Derive approximate answers by relaxing query conditions, i.e., query relaxation

5

Roadmap Introduction Background CoXML Related Work Conclusion

6

XML Data Model XML data is often modeled as an ordered labeled tree

Tree nodes: elements Tree edges: element-nesting relationships

1 article

title2 7 body

Search engine spam detection

section8

..a spam detection technique by content

analysis…

author3

name4 title5

XYZ IEEE Fellow

year6

2003

Content

Element

7

XML Query Model XML queries are often modeled as trees

Structure conditions: a set of query nodes connected by Parent-to-child (‘/’): directly connected Ancestor-to-descendant (‘// ’): connected (either directly or indirectly)

Content conditions: Either value predicates or keyword constraints on query nodes

Examplearticle

title section

search engine

spam detection

year

2003

8

XML Query Answer An answer for a query is a set of nodes in a data tree that

satisfies both structure and content conditions Example

1 article

title2 7 body

Search engine spam detection

section8

..a spam detection technique by content

analysis…

author3

name4 title5

XYZ IEEE Fellow

year6

2003

Data Tree

article

title section

search engine

spam detection

year

2003

Query Tree

9

XML Query Relaxation Types Value relaxation: enlarging a value condition’s search scope

Node relabel: changing the label a node to a similar or a more general label by domain knowledge

article

title year

search engine

2003

section

spam detection

article

title year

search engine

2000-2005

section

spam detection

article

title year

search engine

2003

section

spam detection

document

title year

search engine

2003

section

spam detection

[1] Tree Pattern Relaxation (S. Amer-Yahia, et al., 2000)

10

XML Query Relaxation Types Edge generalization: relaxing a ‘/’ edge to a ‘//’ edge

Node deletion: dropping a node from a query tree

article

title year

search engine

2003

section

spam detection

article

title year

search engine

2003

section

spam detection

article

title year

search engine

2003

section

spam detection

article

yearsearch engine

2003

section

spam detection

11

XML Relaxation Properties Definition

Relaxation operation: an application of a relaxation type to a specific query node or edge

Lemma Given a query tree with n applicable relaxation

operations, there are potentially up to 2n relaxed trees

Possible combinations: ...1n n

n⎛ ⎞ ⎛ ⎞

+ +⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠

12


13

Challenges Query relaxation is often user-specific

Different users may have different approximate matching specifications for a given query tree

How to provide user-specific approximate query answering?

A query with n relaxation operations has potentially up to 2n relaxed queries How to systematically relax a query?

Query relaxation generates a set of approximate answers How to effectively rank the returned approximate answers?

14

CoXML System Overview

Relaxation Engine

Ranking Module

Relaxation Index Builder

RLXQueryranked results

XML Documents

CoXML

XML Database Engine

XTAH

results

query exact answers

relaxed query

query

similarity metrics

relaxation language

relaxation indexes

15

Roadmap Introduction Background CoXML

Relaxation Language Relaxation Indexes Ranking Evaluation Testbed

Related Work Conclusion

16

Relaxation Language Motivation

Enabling users to specify approximate conditions in queries and to control the approximate matching process

RLXQuery - relaxation-enabled XQuery Extends the standard XML query language (XQuery) with

relaxation constructs & controls, such as ~ : approximate conditions ! : non-relaxable conditions REJECT : unacceptable relaxations AT-LEAST : minimum # of answers to be returned RELAX-ORDER : relaxation orders among multiple conditions USE: allowable relaxation types

17

RLXQuery ExampleFOR $a in doc (“bib.xml”)//article

WHERE $a/year = ~2003 V-COND-LABEL t1 and

~($a[about(./!title, “search engine”)]/body/section)[about(.,

“spam detection”)] S-COND-LABEL t2

RETURN $a

RELAX-ORDER (t1, t2)

USE (edge generalization, node deletion)AT-LEAST 20

article

titleyear

search engine

2003

body

section

spam detection

!

t2

t1

18




19

Relaxation Index Naïve approach

Generate all possible relaxed queries & iteratively select the best relaxed query to derive approximate answers

Exhaustive, but not scalable

Observation Many queries share the same (or similar) tree structures

Our approach: relaxation index Consider the structure of a query tree T as a template Build indexes on the relaxed trees of T Use the index to guide the relaxations of any query with the

same (or similar) tree structure as that of T

20

Relaxation Index - XTAH XTAH

A hierarchical multi-level labeled cluster of relaxed trees

Building an XTAH Given a query structure template T, generate all possible

relaxed trees Each relaxed trees uses an unique set of relaxation

operations Cluster relaxed trees into groups based on relaxation

operations and distances similar to “suffix-tree” clustering

21

XTAH Example

article

title body

section

$1

$2 $3

$4

Template structure T

{gen(e$1,$2)} … {gen(e$3, $4)} {del($2)}

…

node_relabeledge_generalization node_deletion

relax

{gen(e$3, $4), gen(e$1,$3)}

...

articlebody

section

T6{gen(e$1, $2), gen(e$3, $4)}

…

{del($2), del($3)}

…

…

……

…

article

title body

section

T2

T4 articletitle body

section

articletitle body

section

T3

articletitle body

section

T1

article

section

T7

A sample XTAH for the template structure T

gen(e$u, $v) – relaxing the edge between $u and $v

del($u) – deleting the node $u

22

XTAH Properties Each group consists of a set of relaxed trees obtained by

using similar relaxation operations Efficient location of relaxed trees based on relaxation

operations

The higher level a group, the less relaxed the trees in the group Relaxing queries at different granularities by traversing up

and down the XTAH

23

XTAH-Guided Query Relaxation Problem

Given a query with relaxation specifications (constructs and controls), how to search an XTAH for relaxed queries that satisfy the specification?

Approach First, prune XTAH groups containing trees that use

unacceptable relaxations as specified in the query This step can be efficiently achieved by utilizing internal node labels

Then, iteratively search the XTAH for the best relaxed query

24

Query Relaxation Process Example

node_relabel

...

node_deletion

relax

…{gen(e$1,$2)} … {gen(e$3, $4)}

…

edge_generalization

{gen(e$3, $4), gen(e$1,$3)}

{gen(e$1, $2), gen(e$3, $4)}

…

…

…

article

title body

section

T2

T4 articletitle body

section

articletitle body

section

T3

articletitle body

section

T1

{del($2)}

articlebody

section

T6 {del($2), del($3)}

…

…

article

section

T7

article

title body

section

$1

$2 $3

$4

The template structure, T

A sample XTAH for the template structure T

article

titleyear

search engine

2003

body

section

spam detection

!

t2t1

Relaxation ControlUSE (edge generalization,

node deletion)AT-LEAST 20

Sample RLXQuery

25

XTAH-Guided Query Relaxation Problem

Given a query and an XTAH, how to efficiently locate the best relaxation candidate at the leaf level?

Approach: M-tree Assign representatives to internal groups Representatives summarize distance properties of the trees within groups Use representatives to guide the search path to the best relaxation candidate

R0

R1 R2 R3

R5 R8R11

relaxed tree j

[2] M-tree: An efficient access method for similarity search in metric space (P. Ciaccia et. al., VLDB 97)

26




27

Ranking Ranking criteria

Based on both content and structure similarities between a query and an answer, i.e., a set of data nodes

Approach Content similarity – extended vector space model Structure similarity – tree editing distance with a model for

assigning operation cost Overall relevancy – a ranking model combing both content

and structure similarities

28

Content Similarity

Term Frequency Inverse Document Frequency

Weighted Term Frequency Inverse Element Frequency

Vector Space Model

Extended Vector Space ModelXML content ranking

Traditional IR ranking

content similarity between a query and an answer (i.e., a set of data nodes)

content similarity between a query and a document

29

Weighted Term Frequency Terms under different paths of a node weight differently Example

The weighted term frequency for a term t in a node v is:

pi: a path under the node v to a term t;

m: # of different paths under the node v that contain the term t

w1

tf ( , ) w( ) tf( , )m

i ii

v t p p t=

= ∗∑

section

spam detection

8 paragraph

…an approach to detect spam by …

12 reference

Spam detection taxonomy

section5

Spam Detection By Content Analysis

6 title

QueryData

30

Inverse Element Frequency The more number of XML elements containing a term,

the less disambiguating power the term has E.g., the term “spam” is less disambiguating than the

term “detection” The inverse element frequency for a query term t is

1

2

($ , ) log Nief u tN

=

$u: a query node whose content condition contains the term t

N1: # of data nodes that match the structure condition related to $u

N2: # of data nodes that match the structure condition related to $u and contain t

31

Extended Vector Space Model The content similarity between an answer A and a

query Q is

|$ . |

w1 1

cont_sim( , ) tf ( , ) ief($ , )iu contn

i ij i iji j

A Q v t u t= =

= ∗∑ ∑

n: # of nodes in Q

{$u1, …, $un}: the set of query nodes in Q

{v1, …, vn}: the set of data nodes in A, where vi matches $ui (1 ≤ i ≤ n)

|$ui.cont|: the number of terms in the content conditions on the node $ui

tij: a term in the content condition on the query $ui

32

Structure Distance Function Both XML data and queries are modeled as trees Similarities between trees are often computed by

editing distances, i.e., the cost of the cheapest sequence of editing operations

that transform one tree into the other tree The structure distance between an answer A and a query

Q can be measured as the total cost of relaxation operations used to derive A

1struct_dist( , ) cost( )

k

ii

A Q r=

=∑{r1, …, rk}: the set of relaxation operations used to derive A

cost(ri): the cost for ri (0 ≤ cost(ri) ≤ 1 )

33

Relaxation Operation Cost Naïve approach

Assign uniform cost to all relaxation operations Simple but ineffective

Our approach Assign an operation cost based on the similarity between

the two nodes being approximated by the operation The closer the two nodes, the less the operation costs

cos ( ) 1 ($ , $ )it r similarity u v= −

ri: a relaxation operation

$u, $v: the two nodes that are being approximated by ri

34

Nodes Approximated By Relaxation Operations

Relaxation Operation

Nodes being approximated by the operation: ($u, $v)

Example

Node relabel (a node with the old label, a node with the new label)

(article, document)

Node deletion (a child node, the parent node) (section, body)

Edge generalization

(a child node, a descendant node) (article/title, article//title)

article

title body

section

Query tree

document

title body

section

Node Relabel

article

title body

Node deletion

article

title body

section

Edge generalization

T1 T2 T3 T4

35

overall relevancy

content similarity structure distance

36

Overall Relevancy Function The overall relevancy of an answer A to a query Q,

sim(A, Q), is a function of cont_sim(A, Q) and struct_dist(A, Q)

Properties sim(A, Q) = cont_sim(A, Q) if struct_dist(A, Q) = 0 sim(A, Q) as cont_sim(, Q) sim(A, Q) as struct_dist(, Q)

Implementationstruct_dist( , )sim( , ) cont_sim( , )A QA Q A Q=α ∗

α is a small constant between 0 and 1

37


Relaxation Indexes Relaxation Language Ranking Evaluation Testbed


38

Evaluation Studies INEX (Initiative for the evaluation of XML)

Similar to TREC for text retrieval

Document collections Scientific articles from IEEE Computer Society 1995 – 2002 About 500MByte Each article consists of 1500 XML nodes on average

Queries Strict content and structure (SCAS) Vague content and structure (VCAS)

Golden standard Relevance assessment provided by INEX

39

Evaluation of Content Similarity Datasets: INEX 03 test collection Query sets: 30 SCAS queries Comparisons: 38 submissions in INEX 03

Recall

Prec

isio

n

0.5 10

0.2

0.4

0.6

0.8

1

Avg. Precision 0.3309

40

Evaluation of the Cost Model Dataset: INEX 05 test collection Query set: 22 simple VCAS queries Evaluation metric: normalized extended cumulative gain (nxCG)

the official evaluation metric used in INEX 05 Given a number i (i1), nxCG@i, similar to precision@i,

measures the relative gain users accumulated up to the rank i E.g., nxCG@10, nxCG@25, nxCG@50, …

Cost Models: UCost: uniform cost for each relaxation operation (Baseline) SCost: our proposed cost model

41

Retrieval performance improvements with semantic cost model

αCost Model

0.1 0.3 0.5 0.7 0.9

Uniform 0.2584 0.2616 0.2828 0.2894 0.2916

Semantic 0.3319 (+28.44%)

0.3190 (+21.94%)

0.3196 (+13.04%)

0.3068 (+6%)

0.2957 (+4.08%)

struct_dist( , )sim( , ) cont_sim( , )A QA Q A Q= ∗α

Assigning relaxation operation with different cost based on the similarities of the nodes being operated improves retrieval performance! nxCG@25 and nxCG@50 yield similar results

Query set: all content-and-structure queries in INEX 05nxCG@10 (α, cost model)

42

Evaluation of the Cost Model Result

αCost Model

0.1 0.3 0.5 0.7 0.9

UCost 0.2584 0.2616 0.2828 0.2894 0.2916SCost 0.3319

(+28.44%)0.3190 (+21.94%)

0.3196 (+13.04%)

0.3068 (+6%)

0.2957 (+4.08%)

struct_dist( , )sim( , ) cont_sim( , )A QA Q A Q= ∗α

Each cell: nxCG@10 for a given pair (α, cost model) (% of improvement over the baseline)

Utilizing node similarities to distinguish costs of different operations improves retrieval performance!Similar results are observed using nxCG@25 and nxCG@50

43

Expressiveness of the Relaxation Language

INEX 05 Topic 267

Expressing Topic 267 using RLXQuery

<inex_topic topic_id="267" query_type="CAS" > <castitle> //article//fm//atl[about(., "digital libraries")] </castitle> <description> Articles containing "digital libraries" in their title. </description> <narrative> I'm interested in articles discussing Digital Libraries as their main subject. Therefore I require that the title of any relevant article mentions "digital library" explicitly. Documents that mention digital libraries only under the bibliography are not relevant, as well as documents that do not have the phrase "digital library" in their title. </narrative></inex_topic>

FOR $a in doc(“inex.xml”)//articleLET $b = $a//fm//!atl REJECT(fm, bb)WHERE $b[about(., “digital libraries”)]RETURN $b

44

Expressing Topic 267 with RLXQuery

Results

FOR $a in doc(“inex.xml”)//articleLET $b = $a//fm//!atl REJECT(fm, bb)WHERE $b[about(., “digital libraries”)]RETURN $b

Evaluation MetricMethod

nxCG@10 nxCG@25

No relaxation control 0.1013 0.2365With relaxation control 1.0 0.8986

Effectiveness of the Relaxation Control

Relaxation control enables the system to provide answers with greater relevancy!

Perfect accuracy

45

Evaluation of the Ranking Function Dataset: INEX 05 test collection Query set: 4 official VCAS queries with available relevance assessments Comparison: top-1 submission in INEX 05

Results MetricTopic

nxCG@10 nxCG@25

Top-1 CoXML Top-1 CoXML

256 0.4293 0.4248 0.4733 0.5555

264 0.0 0.0069 0.0 0.0033

275 0.7715 0.638 0.589 0.5922

284 0.0 0.1259 0.0 0.1233

Average 0.3002 (+0.4%) 0.2989 0.2656 0.3186 (+20%)The systematic relaxation approach enables our system to derive more approximate answers!Our ranking function, based on both content and structure relevancy, outperforms other ranking functions using content similarities only!

46


Relaxation Indexes – XTAH Relaxation Language – RLXQuery Ranking Evaluation Testbed


47

CoXML Testbed

Team Members: Prof. Chu, S. Liu, T. Lee, E. Sung, C. Cardenas, A. Putnam, J. Chen, R. Shahinian

RLXQuery Preprocessor

RLXQuery Parser

Relaxation Manager

DatabaseManager

RankingModule

Relaxation Index Builder XTAH

XML Database Engine

XML Document

s

RelaxationController

RLXQuery

Approximate Answers

48

Relaxation Examples using the Testbed

49

Relaxation Examples using the Testbed

50


51

Related Work: Query Relaxation Relaxation based on schema conversions ([LC01,

LMC01], [LMC03]) No structure relaxation

Native XML relaxation Propose structure relaxation types [e.g., KS01, ACS02]

We use the relaxation types introduced in [ACS02] Investigate efficient algorithms for deriving top-K answers

based on relaxation types supported [e.g, Sch02, ACS02, ALP04, AKM05]

No relaxation control

52

Related Work: XML Ranking Content ranking

Most extend ranking models for text retrieval to the XML scenario, e.g., HyRex, XXL, JuruXML, XSearch

We utilize structure to distinguish terms of different weights occurring in different parts of a document

Structure ranking Based on tree editing distance algorithms w/o considering

operation cost [NJ02] Based on the occurrence frequency of the query trees, paths,

or predicates in data [MAK05, AKM05] Our structure ranking is similar to editing distance, but we

consider operation cost

53

Conclusion Cooperative XML (CoXML) query answering

RLXQuery enables users to effectively express approximate query conditions and to control the approximate matching process

XTAH provides systematic query relaxation guidance

Both content and structure similarity metrics for evaluating the relevancy of approximate answers

Evaluation studies with the INEX test collections demonstrate the effectiveness of our methodology