information retrieval in context
Post on 22-Feb-2016
41 Views
Preview:
DESCRIPTION
TRANSCRIPT
04/22/23 Xuehua Shen @CS, UIUC 1
Information Retrieval in Context Presenter: Xuehua Shen xshen@uiuc.edu
04/22/23 Xuehua Shen @CS, UIUC 2
Presentation Layout
Problem Description Terminology Challenges IntelliZap System[WWW2001] Concerns
04/22/23 Xuehua Shen @CS, UIUC 3
Problem Search Engine has become key source of information 1998[GVU WWW Study]: 85% people use search engine to locate information Now [Craig’s Talk]: 500 million search on Internet per
day 150 million search at Google per day
Efforts on Coverage and Relevance
04/22/23 Xuehua Shen @CS, UIUC 4
Web Search Fact Given 3-5 billion web pages on the Web huge and diverse info provided by Web On average 1.7-words per query [Eric
Brewer CACM 09/2002] little info provided by Users Can search engine retrieve web pages very
well?
04/22/23 Xuehua Shen @CS, UIUC 5
Context Context may provide extra information to help improve search result relevance An example: Searching flowers [DirectHit 1999] Man: typically want sites that let them send flowers Woman: often want sites that let them order flower
seeds or plants for gardening purposes
What context information useful?
04/22/23 Xuehua Shen @CS, UIUC 6
Terminology Ephemeral Context In a single search session Category[Inquirus2], Document being viewed [Watson], Feedback
Persistent Context increment over time, used in subsequent
sessions User profile [My Yahoo!], Query history & Clickthrough Data [Google]
04/22/23 Xuehua Shen @CS, UIUC 7
Terminology cont. Personalization Search Engine use context information to
provide different search results for different users
Customization Users manually configure their preferences
04/22/23 Xuehua Shen @CS, UIUC 8
Challenges How to capture and store useful information?
SearchPad[WWW2001]:• Server-proxy-client architecture• User explicitly mark relevant pages • Any shortcomings? Better ways?
04/22/23 Xuehua Shen @CS, UIUC 9
Challenges cont. Many retrieval models, also many user
models, But how to merge them?
language model is used to represent context by Croft
04/22/23 Xuehua Shen @CS, UIUC 10
Challenges How to build such system, such as
architecture Server side, client side? User Interface? Server side: scalability, privacy Client side: communication of context info
with server
04/22/23 Xuehua Shen @CS, UIUC 11
Challenges How to evaluate such work?
Metrics?
HARD (Hard Accuracy Relevance from Document) Track added this year
leverage additional information about searcher and/or search context
04/22/23 Xuehua Shen @CS, UIUC 12
Intellizap – General Description
Assumption: a large fraction of searches originate while users are reading documents on their computers.
Standpoint: Context is a body of words of surrounding a user-selected phrase
Intellizap System: Meta Search Engine with context-based query augmentation, search engine
selection and reranking
04/22/23 Xuehua Shen @CS, UIUC 13
Walkthrough of IntelliZap
04/22/23 Xuehua Shen @CS, UIUC 14
Walkthrough cont.
04/22/23 Xuehua Shen @CS, UIUC 15
Walkthrough cont.
04/22/23 Xuehua Shen @CS, UIUC 16
Walkthrough cont.
04/22/23 Xuehua Shen @CS, UIUC 17
Walkthrough cont.
04/22/23 Xuehua Shen @CS, UIUC 18
How to use Context augment query before sending queries to
search engines
rerank the results returned by search engines
04/22/23 Xuehua Shen @CS, UIUC 19
How to collect right amount of context
Don’t include all document as Watson System Heuristics 1 establishing optimal context length as a function of the length of text phrase and individual frequencies Heuristics 2 relative weighting of the text and context in augmented query emphasize marked text phrase weight of context word: monotonic function of their proximity to text
04/22/23 Xuehua Shen @CS, UIUC 20
Algorithm Overview
04/22/23 Xuehua Shen @CS, UIUC 21
Step 0: Semantic Network Build Semantic Network (offline): statistics-
based semantic network
Linear combination of vector-based correlation metric and WordNet-based metric
04/22/23 Xuehua Shen @CS, UIUC 22
Semantic Network cont. Vector-based correlation metric: 27 knowledge domains (computer, business etc.) 10,000 documents samples on Internet each word: a 27-dimension vector use correlation to measure distance
WordNet: capture semantic relations between words (hypernymy, hyponymy, meronymy and holonymy).
WordNet:http://www.cogsci.princeton.edu/~wn/
04/22/23 Xuehua Shen @CS, UIUC 23
Step 1: Query Augmentation Extract keywords from context surrounding
the user-selected text utilizing semantic network
typically context – about 50 words
use clustering algorithm to construct several queries of different topics
04/22/23 Xuehua Shen @CS, UIUC 24
Step 2: Search Engine Selection
IntelliZap is a Meta Search Engine
Several general search engines ( such as Google, Altavista)
For several domains, specific search engines( such as WebMD, FindLaw) is assigned to as a priori.
04/22/23 Xuehua Shen @CS, UIUC 25
Step 3: Results Reranking There are several lists of results returned by
several search engines.
Use semantic network to calculate distance between results titles/summaries and text/context
04/22/23 Xuehua Shen @CS, UIUC 26
Evaluation Method
State-of-the-art: lack the benchmark
Use subjects recruited by external agency
Subjects don’t know objective of the experiments,
just asked to do search and evaluate results
04/22/23 Xuehua Shen @CS, UIUC 27
Experiment Result
04/22/23 Xuehua Shen @CS, UIUC 28
Experiment Results cont.
04/22/23 Xuehua Shen @CS, UIUC 29
Concerns Privacy and security Million users info database of My Yahoo! Monitor users through queries they sent! Relevance consistency Communication Problem
04/22/23 Xuehua Shen @CS, UIUC 30
End
Thank you!
04/22/23 Xuehua Shen @CS, UIUC 31
Backup Slides
04/22/23 Xuehua Shen @CS, UIUC 32
Web Statistics Accessibility of Information on the Web Steve Lawrence, Nature 1999
04/22/23 Xuehua Shen @CS, UIUC 33
Semantic Relation Hypernymy: the semantic relation of being superordinate or belonging to a higher rank or class Synonym: superordination Hyponymy: the semantic relation of being subordinate or belonging to a lower rank or class Synonym: subordination Meronymy: the semantic relation that holds between a part and the whole Synonym: part to whole relation Holonymy: the semantic relation that holds between a whole and its partsSynonym: whole to part relation More at http://dictionary.metor.com/wnet/
04/22/23 Xuehua Shen @CS, UIUC 34
Clustering algorithm Traditional clustering algorithm doesn’t
work due to a large amount of noise and a small amount of information available 50 context words represented in 27 D space
Special clustering algorithm-High Dimensional clustering
perform Recurrent Clustering analysis (averaging over iterations)
refine results statistically
04/22/23 Xuehua Shen @CS, UIUC 35
Limitation of Web Freshness Coverage( only publicly indexable web) Bias (not index sites equally)
04/22/23 Xuehua Shen @CS, UIUC 36
Several Systems--1 Inquirus2: meta search engine
Watson Project (Jay Budzik,NWU): contents of full documents being edited in MS Word or Viewed in Explorer
Remembrance Agent (Bradley Rhodes,MIT): software agent just-in-time information retrieval
04/22/23 Xuehua Shen @CS, UIUC 37
Several System--2 Outride (renamed in 2001) GroupFire (spin off from PARC Xerox) in
2000
04/22/23 Xuehua Shen @CS, UIUC 38
Reference [1] Graphic,Visualization and Usability
Center GVU’s 10th WWW User Survey,1998
top related