1 overview of information retrieval and our solutions qiang yang department of computer science and...
Post on 21-Jan-2016
215 Views
Preview:
TRANSCRIPT
1
Overview of Information Retrieval and our Solutions
Qiang Yang
Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology
Hong Kong
2
Why Need Information Retrieval (IR)?
More and more online information in general (Information Overload)
Many tasks rely on effective management and exploitation of information
Textual information plays an important role in our lives
Effective text management directly improves productivity
3
What is IR?
Narrow-sense: IR= Search engine technologies
(Google/Yahoo!/Live Search) IR= Text matching/classification
Broad-sense: IR = Text information management:
How to find useful information? (info. retrieval) (e.g., Yahoo!)
How to organize information? (text classification) (e.g., automatically assign email to different folders)
How to discover knowledge from text? (text mining) (e.g., discover correlation of events)
4
Difficulties Huge Amount of Online Data
Yahoo! has nearly 20 billion pages in its index (as collected at the beginning of 2005)
Different types of data Web-pages, emails, blogs, chatting-room
messages; Ambiguous Queries
Short: 2-4 words Ambiguous: apple; bank…
5
Our Solutions Query Classification
Champion of KDDCUP’05; TOIS (Vol. 24); SIGIR’06; KDD Exploration (Vol. 7)
Query Expansion/Suggestion Submissions to: SIGIR’07; AAAI’07; KDD’07
Entity Resolution Submission to SIGIR’07
Web page Classification/Clustering SIGIR’04; CIKM’04; ICDM’04; ICDE’06; WWW’06; IPM (2007),
DMKD (Vol. 12) Document Summarization
SIGIR’05; IJCAI’07 Analysis of Blogs, Emails, Chatting-room
messages SIGIR’06; ICDM’06 (2); IJCAI’07
6
Outline
Query Classification (QC) Introduction Solution 1: Query/category
enrichment; Solution 2: Bridging classifiers;
Entity Resolution Summary of Other works
7
Query Classification
8
Introduction Web-Query is difficult to manage:
Short; Ambiguous; Evolving
Query Classification (QC) can help to understand query better
Vertical Search Re-rank search results Online Advertisements
Difficulties of QC (Different from text classification) How to represent queries Target taxonomy is dynamic, e.g. online ads
taxonomy Training data is difficult to collect
9
Problem Definition
Inspired by the KDDCUP’05 competition Classify a query into a ranked list of
categories Queries are collected from real search
engines Target categories are organized in a tree
with each node being a category
10
Related Work
Document Classification Feature selection [Yang et al. 1997] Feature generation [Cai et al. 2003] Classification algorithms
Naïve Bayes [Andrew and Nigam 1998] KNN [Yang 1999] SVM [Joachims 1999] ……
An overall survey in [Sebastiani 2002]
11
Related work Query Classification/Clustering
Classify the Web queries by geographical locality [Gravano 2003];
Classify queries according to their functional types [Kang 2003];
Beitzel et al. studied the topical classification as we do. However they have manually classified data [Beitzel 2005];
Beeferman and Wen worked on query clustering using clickthrough data respectively [Beeferman 2000; Wen 2001];
12
Related Work Document/Query Expansion
Borrow text from extra data source Using hyperlink [Glover 2002]; Using implicit links from query log [Shen
2006]; Using existing taxonomies [Gabrilovich
2005]; Query expansion [Manning 2007]
Global methods: independent of the queries Local methods using relevance feedback or
pseudo-relevance feedback
13
Solutions
Queries Target Categories
Target Categories
Queries
Solution 1: Query/Category Enrichment
Solution 2: Bridging classifier
Solution 1: Query/Category Enrichment
14
Solution 1: Query/Category Enrichment
Assumptions & Architecture Query Enrichment Classifiers
Synonym-based classifiers Statistical classifiers
Experiments
15
Assumptions & Architecture The intended meanings of Web queries should
be reflected by the Web; A set of objects exist that cover the target
categories.Construction of Synonym- based
Classifiers
Construction of Statistical Classifier
QuerySearch Engine
Labels of Returned
Pages
Text of Returned
Pages
Classified results
Classified results
Finial ResultsPhase II: the testing phase
Phase I: the training phase
The Architecture of Our Approach
16
Category information
Full text
Query enrichment Textual information
TitleSnippet Category
17
Synonym-based classifiers
C*
Page 1
Page 4
Page 3
Page 2Query
1TC
3TC
2TC
1IC
3IC
4IC
2IC
CategoryMapping
18
Map by Word Matching Direct Matching
High precision, low recall
Synonym-based classifiers
Device
DE
Extended Matching Wordnet “Hardware" → “Hardware; Device ; Equipment“
19
Statistical classifiers: SVM
Apply synonym-based classifiers to map Web pages from intermediate taxonomy to target taxonomy
Obtain <pages, target category> as the training data
Train SVM classifiers for the target categories;
20
Statistical Classifier: SVM
Advantages Circles (triangles) denote
crawled pages Black ones are mapped to the
two categories successfully Fail to map the white ones; For a query, if it happens to be
represented by the white ones, it can not be classified correctly by synonym-based method, but SVM can Disadvantages
Recall can be higher, but precision may hurt Once the target taxonomy changes, we need
to train classifiers again
21
Putting them together: Ensemble of classifiers
Why ensemble? Two kinds of classifiers based on different mechanisms They can be complementary to each other Proper combination can improve the performance
Combination strategies EV (Use validation data) EN (No validation data)
22
Experiment--Data Sets & Eval. Criteria Queries: from KDDCUP 2005
800,000 queries, 800 labeled; three labelers
Evaluation
i
icastaggedcorrectlyarequeriesofA #:
i
icastaggedarequeriesofB #:
i
iciscategorywhosequeriesofC #:
B
APrecision
C
ARecall
RecallPresion
RecallF
Precision2
1
3
1
i)labeler human against (F13
1 F1 Overall
i
23
Experiment: Quality of the Data Sets
Consistency between labelers
The distribution of the labels assigned by the three labelers.
Performance of each labeler against another labelers
24
Experiment Results--Direct vs. Extended Matching Number of pages collected for training
using different mapping methods
F1 of the synonym based classifier and SVM
25
Experiment Results--The number of assigned labels
0.20
0.30
0.40
0.50
0.60
0.70
1 2 3 4 5 6
Number of guessed labels
Pre
S1 S2 S3
SVM EN EDP
0.10
0.20
0.30
0.40
0.50
0.60
1 2 3 4 5 6
Number of guessed labels
Rec
S1 S2 S3
SVM EN EDP
0.20
0.25
0.30
0.35
0.40
0.45
1 2 3 4 5 6
Number of guessed labels
F1
S1 S2 S3
SVM EN EDP
26
Experiment Results-- Effect of Base Classifiers
27
Solutions
Queries Target Categories
Target Categories
Queries
Solution 1: Query/Category Enrichment
Solution 2: Bridging classifierSolution 2: Bridging classifier
28
Solution2: Bridging Classifiers
Our Algorithm Bridging Classifier Category Selection
Experiments Data Set and Evaluation Criteria Results and Analysis
29
Algorithm--Bridging Classifier
Problem with Solution 1: target if fixed, and training needs to repeat
Goal: Connect the target taxonomy and queries by
taking an intermediate taxonomy as a bridge
30
Algorithm--Bridging Classifier (Cont.)
How to connect?
Prior prob. of IjC
The relation between and I
jC
TiC
The relation between and I
jC
q
The relation between andTiC
q
31
Algorithm--Bridging Classifier (Cont.)
Understand the Bridging Classifier
Given and : and are fixed
and
which reflects the size of acts as a weighting factor
tends to be larger when and tend to belong to the same smaller intermediate categories
q
q
V
32
Algorithm--Category Selection
Category Selection for Reducing Complexity
Total Probability (TP)
Mutual Information
33
Experiment--Data Sets and Eval. Criteria
Intermediate taxonomy ODP: 1.5M Web pages, in 172,565
categories
Number of Categories on Different Levels
Statistics of the Numbers of Documents in the Categories on Different Levels
34
Experiment--Result of Bridging Classifiers
All intermediate categories are used Snippet only Best result when n = 60 Improvement by 10.4% and 7.1% in terms
of precision and F1 respectively compared to two previous approaches
35
Experiment--Result of Bridging Classifiers
Best results when using all intermediate categories Reason:
A category with larger granularity may be a mixture of several target categories
It can not be used to distinguish different target categories
Performances of the Bridging Classifier with Different Granularity of Intermediate
Taxonomy
36
Experiment--Effect of category selection
MI works better than TP It favors the categories which are more
powerful to distinguish the target categories
When the category number is around 18,000, the bridging classifier is comparable to, if not better than, the previous approaches
37
Entity Resolution
Definition: Reference & Entity Tsz-Chiu Au, Dana S. Nau: The Incompleteness
of Planning with Volatile External Information. ECAI 2006
Tsz-Chiu Au, Dana S. Nau: Maintaining Cooperation in Noisy Environments. AAAI 2006
Name Reference
Venue Reference
Author Entity
Journal /Conf.Entity
Current Author Search
DBLP CiteSeer Google
All of t
hem retu
rn
the M
IXED lis
t of
refe
rence
s
All of t
hem retu
rn
the M
IXED lis
t of
refe
rence
s
Graphical Model We convert the Entity Resolution into a
Graph Partition Problem Each node denotes
a reference Each edge denotes
the relation of tworeferences
How to measure the Reference Relation
Tsz-Chiu Au, Dana S. Nau: The Incompleteness of Planning with Volatile External Information. ECAI 2006
Ugur Kuter, Dana S. Nau: Using Domain-Configurable Search Control for Probabilistic Planning. AAAI 2005:
CoauthorsAuthors
Research CommunityResearch Area
CoauthorsAuthors
Plaintext Similarity
Features
F1: Title Similarity F2: Coauthor Similarity F3: Venue Similarity F4: Research Community Overlap F5: Research Area Overlap
Research Community Overlap
A1, A2 stands for two author name references F4.1:Similarity(A1, A2)
=Coauthors(Coauthors(A1))∩Coauthors(Coauthors(A2)) F4.2:Similarity(A1, A2)
=Venues(Coauthors(A1))∩Venues(Coauthors(A2))
Coauthors(X) returns the coauthor name set of eachauthor in set X
Venues(Y) returns the venue name set of eachauthor in set Y
Research Area Overlap V1, V2 stands for two venue references F4.1:Similarity(V1, V2)
=Authors(Articles(V1))∩Authors(Articles(V2)) F4.2:Similarity(V1, V2)
=Articles(Authors(Articles(V1)))∩Articles(Authors(Articles(V2)))
Authors(X) returns the author name set of eacharticle in set X
Articles(Y) returns the article set holding a referenceof each element in set Y
System Framework
SimilaritySimilarity
ProbabilityProbability
Experiment Results Our Dataset:
1000 references to 20 author entities from DBLP
Getoor’s DatasetsCiteSeer: 2,892 author references to 1,165 author entitiesarXiv: 58,515 references to 9,200 author entitiesF1 = 97.0%
47
Summary of Other Work
48
Summary of Other Work
Summarization using Conditional Random Fields (IJCAI ’07)
Thread Detection in Dynamic Text Message Streams (SIGIR ’06)
Implicit Links for Web Page Classification (WWW ’06) Text Classification Improved by Multigram Models
(CIKM ’06) Latent Friend Mining from Blog Data (ICDM ’06) Web-page Classification through Summarization
(SIGIR ’04)
49
Summarization using Conditional Random Fields (IJCAI ’07)
Motivation Observation
Summarization Sequence labeling Solution: CRF
Feature functions: , Parameters: ,
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
Step 1:
Step 2:
Step 3:
yt-1
xt-1
yt+1
xt+1
yt
xtSentence
(Observed)
Label(Unobserved)
50
Representation Content-based Structure-based
Sentence Type; Personal Pronouns
Clustering
Thread Detection in Dynamic Text Message Streams (SIGIR ’06)
51
Implicit Links for Web Page Classification (WWW ’06) Implicit link 1 ( LI1)
Assumption: a user tends to click the pages related to the issued query;
Definition: there is an LI1 between d1 and d2 if they are clicked by the same person through the same query;
Implicit link 2 (LI2) Assumption: users tend to click related
pages according to the same query Definition: there is an LI2 between d1 and
d2 if they are clicked according to the same query
52
Text Classification Improved by Multigram Models (CIKM ’06)
Training Stage: For each category
Train an n-multigram model Train an n-gram model on the
sequences
Test Stage: For a test document
For each category, segment the document
Calculate its probability under the corresponding n-gram model
Assign the test document the category under which it has the largest probability
53
Latent Friend Mining from Blog Data (ICDM ’06)
Objective One way to build Web communities Find the people sharing similar
interest with the target person “Interest” is reflected by their “writings” “Writings” are from their “blogs” These people may not know each other They are not linked as in previous study
54
Latent Friend Mining from Blog Data (Cont.) Solutions
Cosine Similarity-based method Calculating the cosine similarity between the
contents of the blogs. Topic Model-based method
Find latent topics in the blogs using latent topic models and calculate the similarity at topic level
Two-level similarity-based method First stage: use an existing topic hierarchy to get
the topic distribution of a blogger’s blogs; Second stage: use a detailed similarity
comparison
55
Web-page Classification through Summarization (SIGIR ’04)
Testing set
Train set
Train Summaries
Testing Summaries
Classifier
Result
Combined
Summarizer
LUHN LSA Supervised
Page-layout analysis
Description
56
Thanks
top related