xml keyword search refinement

LOGO

XML Keyword Search Refinement

郭青松

Outline

Introduction

Query Refinement in Traditional IR

XML Keyword Query Refinement

My work

Why we need query refinement?

User express their query intention by keywords, but their don’t know how to formulate good query Lack of experience Too many expression forms Unfamiliar with the system Have no idea about the data

Query Refinement� Refine the query and get good results

What is Query Refinement?

Query expansion(query reformulation)

Given an ill-formed query from the user, we refine the query and help the user to better retrieve documents.

The goal is to improve precision and/or recall.

Example: “cars” “car”, “automobile”, “auto”

XML Search

Tag + Keyword search book: xml

Path Expression + Keyword search (CAS Queries) /book[./title about “xml db”]

Structure query XPath, XQuery

Keyword search (CO Queries) “xml”

XML Keywords Search VS IR

IR Flat HTML pages Whole page returned

XML Model(tree、 graph) Structural(semi-structural) Semantic-based query(LCA, SLCA…) Information fragment returned

Need of XML Keyword Query Refinement

Hard to know the XML content Especially big xml document

Information fragments(LCA\SLCA) Easily affect the results(Precision ) Huge difference of query results

IR style refinement methods is not suitable for xml Only content be considered Need structure information to form a good

query

Outline

Introduction



My work

Tasks

Spelling CorrectionWord Splitting/Word MergingPhrase SegmentationWord StemmingAcronym ExpansionAdd/Delete Terms Substitution

Classes of Query Refinement

Relevance feedback Users mark documents(relevant, nonrelevant) Reweight the terms in the query

Automatic query Refinement System analysis the relevance of documents

and query, give refined query automatically Global analysis Local analysis

Relevance Feedback

Began in the 1960sImprovement in recall and precision

Basic process as follows1. The user issues their initial query

2. The system returns an initial result set.

3. The user then marks some returned documents as relevant or nonrelevant.

4. The system then re-weights the terms and refine the query results

Relevance Feedback Models

Boolean. Terms appear in document: relevance

Vector Space. q=(t1, t2,…, tn) d=(w1, w2,…, wn)

Probabilistic. Relevance of a query and documents

evaluate as probability Probabilistic ranking principle

n

i

n

iii

n

iii dqdqdqsim

1 1

22

1

)()(),(

Rocchio algorithm for vector-space model

qm :refined query vectorq0: the original query vector Dr : relevant documents , Dnr: nonrelevant

documents α, β, γ: weights attached to each term

Average relevant- document vector

Average non-relevant document vector

Global analysis(1)

Using all documents to compute the similarity of query q and terms in the documents

Similarity Thesaurus based

nnnn

n

n

n

n

www

www

www

t

t

tddd

DK

...

............

...

...

...

...

)(

21

22221

11211

2

1

21

jiij ttw document and termofweight :

vectortermti :

}{,,...,, 21 tkkkkq sm

Global analysis(2)

Select r terms with highest sim value and adding into initial query , reformulate the new query

jvd

juvuvu wwttcj

,,,

Similarity of terms

iqt

qi twqi

,Query vector

qk

jiqijiqt

qijj

ii

cwttwkqkqsim ,,,),(

Similarity of query and terms

Company name

Local analysis

Local analysis: Using initial query results(especially documents front ,local documents) to refine the query

Local clustering Clustering the term of local documents Query refined with the relevant cluster Similarity of terms in query and terms in documents

Local context analysis(LCA) Get the most similar term in local documents with the query q to

expanse Similarity of q and terms in documents

www.themegallery.com

Outline

Introduction



My work

Company name

XML Refinement Manner(1)

Query refined form Keywords query New Keywords Query

• Treat as traditional IR problem• IR with XML Keyword search Semantics

Keywords Structural QueryUser participant

Manually(User Interactive )• Structural Feedback

Automatic

www.themegallery.com

XML Refinement Manner (2)

Manually Refined to new Keywords Query IR(consider the structure of xml)

Manually Transform to Structural Query Relevance Feedback

Automatic Refined to new Keywords Query Lu jiaheng:

Automatic Transform to Structural Query NLP

Automatic Refined to new Keywords Query(1)

Query Refined Query Rule based

Operation Term merging: Term splitting: Term substitution: Term deletion

kkkk n ,...,, 21

nkkkk ,...,, 21''

2'121 ,...,,,...,, nn kkkkkk

Original query Refined query

IR,2003,Mike Information Retrieval,2003,Mike

Mike, publication Mike, publications

Database, paper Database, in-proceedings

XML, John,2003 XML, John

machin, learn machine, learning

Hobby, news, paper Hobby, newspaper

On, line, data, base Online, database

Automatic Refined to new Keywords Query(2)

Ranking Refined query candidates set S(RQ)

Refinement cost Cost: the step of “op” from “Q” to “RQ” Dynamic programming

Efficient Refinement Algorithms Avoid the multiple scan invert list stack-based ,stack-based, short-list-eager approach

RQ candidates have the same refinement cost Q={XML, Jim, 2001}{XML, 2001}, {Jim, 2001} or

{XML, Jim}

NLPX

Natural Language Query (NLQ) NEXINEXI(Narrowed Extended XPath I)

//A[about(//B,C)] A: path expression, B :relative path expression to A C is the content requirement. ‘about’ clause represents an individual

information request.

NLPX—Lexical and Semantic Tagging

structural words: content requirementsboundary words: Path expression

instruction words R :return request , S :support request.

Find sections about compression in articles about information retrieval

Tagged: Find/XIN sections/XST about/XBD compression/NN in/IN articles/XST about/XBD information/NN retrieval/NN

NLPX—Template Matching

most queries correspond to a small set of patterns

formulate grammar templates with patternsQuery: Request+ Request : CO_Request | CAS_Request CO_Request: NounPhrase+ CAS_Request: SupportRequest | ReturnRequest SupportRequest: Structure [Bound] NounPhrase+ ReturnRequest: Instruction Structure [Bound] NounPhrase+

Grammar Templates

Request 1 Request 2 Structural: /article/sec /articlec Content: compression information retrieval Instruction: R S

Information Requests

NLPX—NEXI Query Production

merge the information request into NEXI query.

A[about(.,C)] A :the request structural attribute and C : the request content attribute.

//article[about(.,information retrieval)]//sec[about (.,compression)]

Query generation process

Create target component Break up the query into units

Generate initial target combinations of input target components

Generate queries modifying a target component combing two components

Initialization

Breaks up the input query into terms Structure( XML tags or attributes) Content term(refer to text)

Create component Structure term unbound target Content term binding to a bound target

Probability enumeration

Target component and target sets

{//author[~’jennifer widom’]} 0.6842{//editor[~’jennifer widom’]} 0.3150 {//title[~’jennifer widom’]} 0.0004

{//article} 0.5000

{//inproceedings} 0.5000

Jennifer widom

papers

{//article} {//author[∼‘jennifer widom’]} 0.3421{//inproceedings} {//author[∼‘jennifer widom’]} 0.3421{//inproceedings} {//editor[∼‘jennifer widom’]} 0.1577{//article} {//editor[∼‘jennifer widom’]} 0.1577{//inproceedings} {//title[∼‘jennifer widom’]} 0.0002{//article} {//title[∼‘jennifer widom’]} 0.0002

Query: Papers by jennifer widom

Transformation Operators(1)

Aggregation: merge targets with same tag {//a}, {//a[~’x’]} {//a[~’x’]} {//a[~’x’]} , {//a[~’y’]} {//a[~’x y’]}

Prefix expansion: add an ancestor condition {//b} {//a//b} {//b[~’x’]} {//a//b[~’x’]}

Ordering: combine targets {//a}, {//b} {//a//b} or {//a[//b]} {//a}, {//b[~’x’]} {//a//b[~’x’]} or {//a[//b[~’x’]]}

Conclusion

Two stronger assumption Keyword query non-ambiguity Availability of XML thesaurus

Accuracy: terms classification didn’t consider specific

XML contextTime costly:

Term classification Targets create scan the XML documents

Outline

Introduction



My work

LOGOwww.themegallery.com

xml keyword search refinement

Documents