xclean: providing valid spelling suggestions for xml keyword queries

25
XClean: Providing Valid Spelling Suggestions for XML Keyword Queries Yifei Lu 1 , Wei Wang 1 , Jianxin Li 2 and Chengfei Liu 2 1 University of New South Wales 2 Swinburne University of Technology

Upload: kalyca

Post on 22-Feb-2016

34 views

Category:

Documents


0 download

DESCRIPTION

XClean: Providing Valid Spelling Suggestions for XML Keyword Queries. Yifei Lu 1 , Wei Wang 1 , Jianxin Li 2 and Chengfei Liu 2 1 University of New South Wales 2 Swinburne University of Technology. XML Keyword Search. User: I want to find data mining paper coauthored by Jiawei Han. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

Yifei Lu1, Wei Wang1, Jianxin Li2 and Chengfei Liu2

1 University of New South Wales2 Swinburne University of Technology

Page 2: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

XML Keyword Search

2

Query: jiawi minning

User: I want to find data mining paper coauthored by Jiawei Han

author

DBLP

paper paper book

title title author

Mining concept

link mining

author

Eric

author

Jiawei Han

Jiawei Han

Manning

author

Jian Pei

Page 3: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

3

Challenges

Must offer highly plausible suggestion

The suggested query should have non-empty results

Must be highly efficient

Page 4: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

4

Poor Suggestion

Page 5: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

5

Empty Result

Page 6: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

Pu and Yu [PVLDB08] will suggest “jian manning” Worse than “jiawei mining” No meaningful connection

Empty Result

6

Query: jiawi minning

author

DBLP

paper paper book

title title author

Mining concept

link mining

author

Eric

author

Jiawei Han

Jiawei Han

Manning

author

Jian Pei

Page 7: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

Problem Definition

Data A set of XML document trees Form a single tree by adding a virtual root node.

Query = { jiawi minning}

Candidate Query Space

Query Cleaning Find top-k queries from the Candidate Query Space Rank by 7

Pr(C |Q,T),C∈S

jiawijiawei

jianminingminning

manning

Confusion Set:Valid words in vocabulary,

with edit distance ≤ threshold€

S€

T

Q

Page 8: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

Ranking Candidate Queries

How to model By Bayes’ Theorem

Rank by

8

Pr(C |Q,T) =Pr(Q |C,T)Pr(C |T)

Pr(Q |T)

Error Model Query Likelihood Model

Pr(Q |C,T)⋅Pr(C |T)€

Pr(C |Q,T)

Page 9: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

Error Model

Modeling Typographical Errors The more similar the more likely Similarity measured by Edit Distance

Independence Assumption

9

Edit Distance

Pr(q |w) =1z⋅ exp(−β⋅ ed(q,w))€

Pr(Q |C,T)

Pr(Q |C,T) = Pr(Q |C) = Pr(Q[ j] |C[ j])1≤ j≤l∏

minninglinking

findingmanning

running

bindingmining

ed=1

ed=2

Page 10: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

Query Likelihood Model

Modeling Query Generation Probability A good query finds good results is a set of disjoint entities (sub-trees) Measure the query likelihood on each entity Aggregate through all entities

10

Pr(C |T) = Pr(C | r)⋅Pr(r |T)r∈entities∑

Pr(C |T)

Entity Prior

r1r2 r3

(assume uniform)

author

DBLP

paper paper book

title title author

Mining concept

link mining

author

Jian

author

Jiawei Han

Jiawei Han

Manning

T

Page 11: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

Language Modeling

Modeling query likelihood on entities Extract text in the sub-tree Build a Language Model

11

Pr(C | r)

Word Freq Pr(w)mining 2 0.2data 2 0.2jiawei 1 0.1concept 1 0.1drifting 1 0.1han 1 0.1knowledge 1 0.1discovery 1 0.1

Pr(C | r1) = Pr( jiawei | r1)⋅Pr(mining | r1)= 0.2 × 0.2 = 0.04

Smoothing is used to avoid zero probability

r1

DBLP

paper

title

Mining concept drifting data

……

author

Jiawei Han

booktitle

Data mining and knowledge

discovery

Page 12: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

How to find the entities Each entity is a potential search result Different semantics can be applied

SLCA, ELCA, etc. Specific Return Type

One for each query Popular type But not too deep

Finding the entities

12p=/DBLP/paper

author

DBLP

paper paper book

title title author

Mining concept

link mining

author

Eric

author

Jiawei Han

Jiawei Han

Manning

Page 13: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

13

Summary: Ranking Framework

Pr(Q |C,T)⋅ Pr(C | r)Pr(r |T)r∑

Error Model Entity PriorQuery

likelihood on each entity

Page 14: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

14

Algorithm

Naïve Algorithm Enumerate all possible candidate queries Find the entities and compute the score for each

candidate query Problems:

Multiple passes of data Not all candidates are needed

author

DBLP

paper paper book

title author

link

author

Jian

author

Jiawei Jiawei Manning

1. Jiawei mining2. Jian mining3. Jiawei Manning4. Jian Manning

author

Jian

Page 15: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

XClean Example

15

1

jiawei

jian

1.1.1.1.1 1.2.2.1

1.1.1.2.1 1.3.1.1.1

mining

manning1.2.1.1

1.3.2.1.1

author

DBLP

paper paper book

title

authormining

authors

jian

author

jiawei

jiaweimanning

author author

jian

authors

1.1 1.2 1.3

1.1.1 1.2.1 1.2.2 1.3.1

1.1.1.1 1.1.1.2 1.2.1.1 1.2.2.1 1.3.1.1 1.3.2.1

1.1.1.1.1 1.1.1.2.1 1.3.1.1.1 1.3.2.1.1

Query: jiawi minning

p1 p2

p3

p4

p1

p2

p3

p4

Page 16: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

XClean Example

16

1

1.1.1.1.1 1.2.2.1

1.1.1.2.1 1.3.1.1.1

1.2.1.1

1.3.2.1.1

author

DBLP

paper paper book

title

authormining

authors

jian

author

jiawei

jiaweimanning

author author

jian

authors

1.1 1.2 1.3

1.1.1 1.2.1 1.2.2 1.3.1

1.1.1.1 1.1.1.2 1.2.1.1 1.2.2.1 1.3.1.1 1.3.2.1

1.1.1.1.1 1.1.1.2.1 1.3.1.1.1 1.3.2.1.1

Query: jiawi minning

p1

p2

p3

p4

“Jiawei mining” is generated“Jian mining” is skipped

jiawei

jian

mining

manning

p1

p2

p3

p4

Page 17: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

XClean Example

17

1

1.1.1.1.1 1.2.2.1

1.1.1.2.1 1.3.1.1.1

1.2.1.1

1.3.2.1.1

author

DBLP

paper paper book

title

authormining

authors

jian

author

jiawei

jiaweimanning

author author

jian

authors

1.1 1.2 1.3

1.1.1 1.2.1 1.2.2 1.3.1

1.1.1.1 1.1.1.2 1.2.1.1 1.2.2.1 1.3.1.1 1.3.2.1

1.1.1.1.1 1.1.1.2.1 1.3.1.1.1 1.3.2.1.1

Query: jiawi minning

p2 p4

jiawei

jian

mining

manning

p1

p2

p3

p4

“jian manning” is generated

Page 18: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

Experiment Settings

Algorithms XClean PY08: Pu and Yu [PVLDB08] SE1: Search Engine 1 SE2: Search Engine 2

Measures Mean Reciprocal Rank Precision@N Time

18

Page 19: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

19

Experiment Settings

Datasets

Queries Clean: original clean queries

INEX: 285 DBLP: 49

Random: random edit operations on each keyword Rule: replace each word with a common misspelling

Dataset size(MB) #node Max depth

Avg depth

Queries

INEX 5,878 52M 50 5.58 285DBLP 526 12M 7 3.8 49

Page 20: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

Experiment Results

Mean Reciprocal Rank (MRR)

20

MRR =1N

1rank(Qi)1≤i≤N

Page 21: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

Experiment Results

Precision@N Percentage of queries for which the correct suggestion is in

top-N suggestions

21

Page 22: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

Experiment Results

Time Query processing time

22

Page 23: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

Conclusion

Contributions A probabilistic framework for keyword query cleaning on XML

database. An Error Model based on edit distance A Query Likelihood Model that exploits XML tree structures and

keyword search semantics

Future work Concatenation/Splitting of words Cognitive Errors

23

Page 24: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

Thank you!Questions?

24

Page 25: XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

XClean Algorithm

1) Find variants for each query keyword , and compute the error probability

2) Retrieve the XML nodes containing each variant through an inverted index

3) The nodes of all variants of form a virtual list4) Find the entity nodes that have at least one child node

from each virtual lista) Compute the for each candidate query found in

each entity b) Accumulate the scores in a global hash table

5) Output top-k candidate queries

25

Pr(qi |wij )

wij

qi

qi

Pr(C | r)

C

r