xml keyword search refinement
Embed Size (px)
DESCRIPTION
XML Keyword Search Refinement. 郭青松. Outline. Introduction Query Refinement in Traditional IR XML Keyword Query Refinement My work. Why we need query refinement?. User express their query intention by keywords, but their don’t know how to formulate good query Lack of experience - PowerPoint PPT PresentationTRANSCRIPT

LOGO
XML Keyword Search Refinement
郭青松

Outline
Introduction
Query Refinement in Traditional IR
XML Keyword Query Refinement
My work

Why we need query refinement?
User express their query intention by keywords, but their don’t know how to formulate good query Lack of experience Too many expression forms Unfamiliar with the system Have no idea about the data
Query Refinement� Refine the query and get good results

What is Query Refinement?
Query expansion(query reformulation)
Given an ill-formed query from the user, we refine the query and help the user to better retrieve documents.
The goal is to improve precision and/or recall.
Example: “cars” “car”, “automobile”, “auto”

XML Search
Tag + Keyword search book: xml
Path Expression + Keyword search (CAS Queries) /book[./title about “xml db”]
Structure query XPath, XQuery
Keyword search (CO Queries) “xml”

XML Keywords Search VS IR
IR Flat HTML pages Whole page returned
XML Model(tree、 graph) Structural(semi-structural) Semantic-based query(LCA, SLCA…) Information fragment returned

Need of XML Keyword Query Refinement
Hard to know the XML content Especially big xml document
Information fragments(LCA\SLCA) Easily affect the results(Precision ) Huge difference of query results
IR style refinement methods is not suitable for xml Only content be considered Need structure information to form a good
query

Outline
Introduction
Query Refinement in Traditional IR
XML Keyword Query Refinement
My work

Tasks
Spelling CorrectionWord Splitting/Word MergingPhrase SegmentationWord StemmingAcronym ExpansionAdd/Delete Terms Substitution

Classes of Query Refinement
Relevance feedback Users mark documents(relevant, nonrelevant) Reweight the terms in the query
Automatic query Refinement System analysis the relevance of documents
and query, give refined query automatically Global analysis Local analysis

Relevance Feedback
Began in the 1960sImprovement in recall and precision
Basic process as follows1. The user issues their initial query
2. The system returns an initial result set.
3. The user then marks some returned documents as relevant or nonrelevant.
4. The system then re-weights the terms and refine the query results

Relevance Feedback Models
Boolean. Terms appear in document: relevance
Vector Space. q=(t1, t2,…, tn) d=(w1, w2,…, wn)
Probabilistic. Relevance of a query and documents
evaluate as probability Probabilistic ranking principle
n
i
n
iii
n
iii dqdqdqsim
1 1
22
1
)()(),(

Rocchio algorithm for vector-space model
qm :refined query vectorq0: the original query vector Dr : relevant documents , Dnr: nonrelevant
documents α, β, γ: weights attached to each term
Average relevant- document vector
Average non-relevant document vector

Global analysis(1)
Using all documents to compute the similarity of query q and terms in the documents
Similarity Thesaurus based
nnnn
n
n
n
n
www
www
www
t
t
tddd
DK
...
............
...
...
...
...
)(
21
22221
11211
2
1
21
jiij ttw document and termofweight :
vectortermti :
}{,,...,, 21 tkkkkq sm

Global analysis(2)
Select r terms with highest sim value and adding into initial query , reformulate the new query
jvd
juvuvu wwttcj
,,,
Similarity of terms
iqt
qi twqi
,Query vector
qk
jiqijiqt
qijj
ii
cwttwkqkqsim ,,,),(
Similarity of query and terms

Company name
Local analysis
Local analysis: Using initial query results(especially documents front ,local documents) to refine the query
Local clustering Clustering the term of local documents Query refined with the relevant cluster Similarity of terms in query and terms in documents
Local context analysis(LCA) Get the most similar term in local documents with the query q to
expanse Similarity of q and terms in documents
www.themegallery.com

Outline
Introduction
Query Refinement in Traditional IR
XML Keyword Query Refinement
My work

Company name
XML Refinement Manner(1)
Query refined form Keywords query New Keywords Query
• Treat as traditional IR problem• IR with XML Keyword search Semantics
Keywords Structural QueryUser participant
Manually(User Interactive )• Structural Feedback
Automatic
www.themegallery.com

XML Refinement Manner (2)
Manually Refined to new Keywords Query IR(consider the structure of xml)
Manually Transform to Structural Query Relevance Feedback
Automatic Refined to new Keywords Query Lu jiaheng:
Automatic Transform to Structural Query NLP

Automatic Refined to new Keywords Query(1)
Query Refined Query Rule based
Operation Term merging: Term splitting: Term substitution: Term deletion
kkkk n ,...,, 21
nkkkk ,...,, 21''
2'121 ,...,,,...,, nn kkkkkk
Original query Refined query
IR,2003,Mike Information Retrieval,2003,Mike
Mike, publication Mike, publications
Database, paper Database, in-proceedings
XML, John,2003 XML, John
machin, learn machine, learning
Hobby, news, paper Hobby, newspaper
On, line, data, base Online, database

Automatic Refined to new Keywords Query(2)
Ranking Refined query candidates set S(RQ)
Refinement cost Cost: the step of “op” from “Q” to “RQ” Dynamic programming
Efficient Refinement Algorithms Avoid the multiple scan invert list stack-based ,stack-based, short-list-eager approach
RQ candidates have the same refinement cost Q={XML, Jim, 2001}{XML, 2001}, {Jim, 2001} or
{XML, Jim}

NLPX
Natural Language Query (NLQ) NEXINEXI(Narrowed Extended XPath I)
//A[about(//B,C)] A: path expression, B :relative path expression to A C is the content requirement. ‘about’ clause represents an individual
information request.

NLPX—Lexical and Semantic Tagging
structural words: content requirementsboundary words: Path expression
instruction words R :return request , S :support request.
Find sections about compression in articles about information retrieval
Tagged: Find/XIN sections/XST about/XBD compression/NN in/IN articles/XST about/XBD information/NN retrieval/NN

NLPX—Template Matching
most queries correspond to a small set of patterns
formulate grammar templates with patternsQuery: Request+ Request : CO_Request | CAS_Request CO_Request: NounPhrase+ CAS_Request: SupportRequest | ReturnRequest SupportRequest: Structure [Bound] NounPhrase+ ReturnRequest: Instruction Structure [Bound] NounPhrase+
Grammar Templates
Request 1 Request 2 Structural: /article/sec /articlec Content: compression information retrieval Instruction: R S
Information Requests

NLPX—NEXI Query Production
merge the information request into NEXI query.
A[about(.,C)] A :the request structural attribute and C : the request content attribute.
//article[about(.,information retrieval)]//sec[about (.,compression)]

Query generation process
Create target component Break up the query into units
Generate initial target combinations of input target components
Generate queries modifying a target component combing two components

Initialization
Breaks up the input query into terms Structure( XML tags or attributes) Content term(refer to text)
Create component Structure term unbound target Content term binding to a bound target
Probability enumeration

Target component and target sets
{//author[~’jennifer widom’]} 0.6842{//editor[~’jennifer widom’]} 0.3150 {//title[~’jennifer widom’]} 0.0004
{//article} 0.5000
{//inproceedings} 0.5000
Jennifer widom
papers
{//article} {//author[∼‘jennifer widom’]} 0.3421{//inproceedings} {//author[∼‘jennifer widom’]} 0.3421{//inproceedings} {//editor[∼‘jennifer widom’]} 0.1577{//article} {//editor[∼‘jennifer widom’]} 0.1577{//inproceedings} {//title[∼‘jennifer widom’]} 0.0002{//article} {//title[∼‘jennifer widom’]} 0.0002
Query: Papers by jennifer widom

Transformation Operators(1)
Aggregation: merge targets with same tag {//a}, {//a[~’x’]} {//a[~’x’]} {//a[~’x’]} , {//a[~’y’]} {//a[~’x y’]}
Prefix expansion: add an ancestor condition {//b} {//a//b} {//b[~’x’]} {//a//b[~’x’]}
Ordering: combine targets {//a}, {//b} {//a//b} or {//a[//b]} {//a}, {//b[~’x’]} {//a//b[~’x’]} or {//a[//b[~’x’]]}

Conclusion
Two stronger assumption Keyword query non-ambiguity Availability of XML thesaurus
Accuracy: terms classification didn’t consider specific
XML contextTime costly:
Term classification Targets create scan the XML documents

Outline
Introduction
Query Refinement in Traditional IR
XML Keyword Query Refinement
My work

LOGOwww.themegallery.com