reasoning and identifying relevant matches for xml keyword search yi chen ziyang liu, yi chen...

25
Reasoning and Identifying Relevant Reasoning and Identifying Relevant Matches for XML Keyword Search Matches for XML Keyword Search Ziyang Liu, Yi Chen Yi Chen Arizona State University

Post on 18-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Reasoning and Identifying Relevant Reasoning and Identifying Relevant Matches for XML Keyword SearchMatches for XML Keyword Search

Ziyang Liu, Yi ChenYi ChenArizona State University

VLDB 2008, Auckland, New Zealand

team

name

Grizzlies

players

player

name position

Spain forward

player

nationality position

Miller USA guard

name

player

nationality position

Brown USA

name

forward

league

name

NBA

founded

1946

division

southwest

arena

FedExForum

founded

1995

Gasol

team team… …

nationality

MotivationMotivation

Identifying relevant matches is a critical step of processing XML search.

Query: “Gasol, position”

relevant matches

irrelevant matches

VLDB 2008, Auckland, New Zealand

How to Evaluate Various How to Evaluate Various Strategies?Strategies?

Existing approaches for identifying relevant matches:XKSearch (SLCA) [Xu and Papakonstantinou 2005]

XRank [Guo et al. 2003]

XSEarch [Cohen et al. 2003] Star-semantics All-semantics

Schema-free XQuery (MLCA) [Li et al. 2004]

CVLCA [Li et al. 2007]

VLDB 2008, Auckland, New Zealand

How to Evaluate Various How to Evaluate Various Strategies?Strategies?

The traditional approach Obtain ground truth of query results by user studies on a large number of

documents and queries. Measure the precision and recall of a strategy wrt ground truth Costly

An axiomatic approach Formalize broad intuitions as a collection of simple axioms and evaluate

strategies based on the axioms. It has been successful in many areas, e.g. mathematical economics,

clustering, location theory, collaborative filtering, etc Cost-effective

Problem: Is it possible to evaluate and reason about XML keyword search strategies in a formal axiomatic framework?

VLDB 2008, Auckland, New Zealand

RoadmapRoadmap

Motivation and Problem Definition

Challenges and Contributions

Four properties that an XML search engine should satisfy Query Monotonicity/Consistency Data Monotonicity/Consistency

MaxMatch: the first system that satisfies all four properties

Experimental Evaluation

Conclusions

VLDB 2008, Auckland, New Zealand

team

name

Grizzlies

players

player

name position

Spain forward

player

nationality position

Miller USA guard

name

player

nationality position

Brown USA

name

forward

league

name

NBA

founded

1946

division

southwest

arena

FedExForum

founded

1995

Gasol

team team… …

nationality

ChallengeChallengeIt is easy for an individual to assess the relevance of matches

But it is extremely difficult to formalize the relevance assessment, independently of any query, data, algorithm, and user

Query: “Gasol, position”

relevant matches

irrelevant matches

VLDB 2008, Auckland, New Zealand

Example: Similar QueriesExample: Similar QueriesInterestingly, we discovered that some abnormal behaviors can be clearly observed when examining results of two similar queries or one query on two similar documents produced by the same search engine.

team

name

Grizzlies

players

player

name position

Spain forward

player

nationality position

Miller USA guard

name

player

nationality position

Brown USA

name

forward

league

name

NBA

founded

1946

division

southwest

arena

FedExForum

founded

1995

Gasol

team team… …

nationality

Q1: “Gasol, position”Q2: “Grizzlies, Gasol, position”

These two “position” nodes should still be irrelevant.

VLDB 2008, Auckland, New Zealand

Example: Similar DataExample: Similar Data

team

name

Grizzlies

players

player

name position

Spain forward

player

nationality position

Miller USA guard

name

player

nationality

Brown USA

name

league

name

NBA

founded

1946

division

southwest

arena

FedExForum

founded

1995

Gasol

team team… …

nationality

Q: “Grizzlies, Gasol, Brown, position”

position

forward

An empty result after data insertion is abnormal.

How to capture the logical connection between query results?

VLDB 2008, Auckland, New Zealand

Contributions of This WorkContributions of This WorkThe first work that formally reasoned about keyword search in an axiomatic framework

We identified four desirable properties that an XML search engine should satisfy.Data/Query Monotonicity capture the desirable changes to

the number of query resultsData/Query Consistency capture the desirable changes to the

content of a query result

We reasoned about existing XML keyword search strategies.

We proposed MaxMatch - the only XML keyword search strategy that possess all properties.

Experiments verified our intuition and demonstrated the effectiveness and efficiency of MaxMatch.

VLDB 2008, Auckland, New Zealand

RoadmapRoadmap

Motivation and Problem Definition

Challenges and Contributions

Four properties that an XML search engine should satisfy Query Monotonicity/Consistency Data Monotonicity/Consistency

MaxMatch: the first system that satisfies all four properties

Experimental Evaluation

Conclusions

VLDB 2008, Auckland, New Zealand

Properties wrt Similar Properties wrt Similar QueriesQueries

Query Monotonicity When we add a keyword to the query, the query becomes more

restrictive, therefore the number of query results should not increase.

Query Consistency When we add a new keyword to the query, each delta subtree

that newly becomes (part of) a query result should contain the new keyword.

VLDB 2008, Auckland, New Zealand

team

name

Grizzlies

players

player

name position

Spain forward

player

nationality position

Miller USA guard

name

player

nationality position

Brown USA

name

forward

league

name

NBA

founded

1946

division

southwest

arena

FedExForum

founded

1995

Gasol

team team… …

nationality

Example: Query Monotonicity/ConsistencyExample: Query Monotonicity/Consistency

Q1: “forward, name”Q2: “forward, USA, name”

New Keyword

Monotonicity: the number of query results reduces from 2 to 1.

Consistency: in each result, the delta sub-tree (if exists) contains “USA”.

VLDB 2008, Auckland, New Zealand

Example Revisited: Violation of Query Example Revisited: Violation of Query ConsistencyConsistency

team

name

Grizzlies

players

player

name position

Spain forward

player

nationality position

Miller USA guard

name

player

nationality position

Brown USA

name

forward

league

name

NBA

founded

1946

division

southwest

arena

FedExForum

founded

1995

Gasol

team team… …

nationality

Q1: “Gasol, position”

An XML keyword search engine that considers these nodes as relevant for the new query violates query consistency .

Q2: “Grizzlies, Gasol, position”

VLDB 2008, Auckland, New Zealand

Properties wrt Similar DataProperties wrt Similar Data

Data Monotonicity When we add a node to the data, the data content becomes

richer, and the number of query results should not decrease.

Data Consistency After we add a node to the data, each delta subtree that

becomes (part of) a query result should contain the newly inserted node.

VLDB 2008, Auckland, New Zealand

Example: Data Example: Data Monotonicity/ConsistencyMonotonicity/Consistency

team

name

Grizzlies

players

player

name position

Spain forward

player

nationality position

Miller USA guard

name

player

nationality

Brown USA

name

league

name

NBA

founded

1946

division

southwest

arena

FedExForum

founded

1995

Gasol

team team… …

nationality

Q: “forward, name”

position

forward

New Match

Monotonicity: the number of query results increases from 1 to 2.

Consistency: in each result, the delta sub-tree (if exists) contains the new data node.

VLDB 2008, Auckland, New Zealand

Example Revisited: Violation of Data Example Revisited: Violation of Data MonotonicityMonotonicity

team

name

Grizzlies

players

player

name position

Spain forward

player

nationality position

Miller USA guard

name

player

nationality

Brown USA

name

league

name

NBA

founded

1946

division

southwest

arena

FedExForum

founded

1995

Gasol

team team… …

nationality

Q: “Grizzlies, Gasol, Brown, position”

position

forward

An XML keyword search engine that outputs an empty result on the updated data violates data monotonicity.

VLDB 2008, Auckland, New Zealand

The Proposed Axiomatic The Proposed Axiomatic FrameworkFramework

Four desirable properties Query Monotonicity Query Consistency Data Monotonicity Data Consistency

These properties are: Non-trivial

No prior XML keyword system satisfies all of them.

Non-redundant An algorithm may violate any one of them while satisfying others.

Satisfiable We propose a novel technique – MaxMatch - that satisfies all four

properties.

VLDB 2008, Auckland, New Zealand

RoadmapRoadmap

Motivation and Problem Definition

Challenges and Contributions

Four properties that an XML search engine should satisfy Query Monotonicity/Consistency Data Monotonicity/Consistency

MaxMatch: the first system that satisfies all four properties

Experimental Evaluation

Conclusions

VLDB 2008, Auckland, New Zealand

MaxMatchMaxMatch

MaxMatch’s name comes from “Maximal Match”

MaxMatch preserves each subtree whose set of descendant keyword matches is “Maximal” among its siblings. Intuitively, the subtrees that are removed are strictly less

relevant to the query since fewer keywords are contained.

VLDB 2008, Auckland, New Zealand

MaxMatchMaxMatch

team

name

Grizzlies

players

player

name position

Spain forward

player

nationality position

Miller USA guard

name

player

nationality position

Brown USA

name

forward

league

name

NBA

founded

1946

division

southwest

arena

FedExForum

founded

1995

Gasol

team team… …

nationality

Q: Grizzlies, Gasol, Brown, position

Not as informative as its siblings: discarded

MaxMatch satisfies all four properties.

Proof details and algorithms can be found in the paper.

VLDB 2008, Auckland, New Zealand

RoadmapRoadmap

Motivation and Problem Definition

Challenges and Contributions

Four properties that an XML search engine should satisfy Query Monotonicity/Consistency Data Monotonicity/Consistency

MaxMatch: the first system that satisfies all four properties

Experimental Evaluation

Conclusions

VLDB 2008, Auckland, New Zealand

Search QualitySearch QualityData set: Baseball, Mondial

Query set: 36 queries in total

Ground truth: obtained by user study.

User perception of search results on query pairs and document pairs confirms our intuition of the proposed properties

F-measure of MaxMatch vs. Existing Approaches

VLDB 2008, Auckland, New Zealand

Processing TimeProcessing Time

Mondial Data (515KB) Baseball Data (1014KB)

VLDB 2008, Auckland, New Zealand

ConclusionsConclusions

This is the first work on reasoning about and evaluating XML keyword search strategies using a formal axiomatic framework.

Four intuitive and elegant properties are proposed: query monotonicity/consistency, data monotonicity/consistency.

We designed and developed MaxMatch - the only XML keyword search strategy that satisfies all properties.

Experiments verified the intuition of the properties and the effectiveness and efficiency of MaxMatch.

MaxMatch is incorporated as part of XSeek [Liu & Chen Sigmod 07]

Thank You!Thank You!

Questions?Questions?

Welcome to try MaxMatch Welcome to try MaxMatch at: xseek.asu.eduat: xseek.asu.edu