filtering semi-structured documents based on faceted feedback

31
Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM) Lab University of California, Santa Cruz

Upload: maya-dillon

Post on 31-Dec-2015

23 views

Category:

Documents


1 download

DESCRIPTION

Filtering Semi-Structured Documents Based on Faceted Feedback. Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM) Lab University of California, Santa Cruz. Personalized Information Filtering. Identify user-desired documents from a document stream - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Filtering Semi-Structured Documents Based on Faceted Feedback

Filtering Semi-Structured DocumentsBased on Faceted Feedback

Lanbo Zhang, Yi Zhang, Qianli XingInformation Retrieval and Knowledge Management (IRKM) Lab

University of California, Santa Cruz

Page 2: Filtering Semi-Structured Documents Based on Faceted Feedback

Personalized Information Filtering• Identify user-desired documents from a

document stream• Two families of filtering approaches

– Collaborative Filtering (CF)– Content-Based Filtering (CBF)

• Applications: news feeder, email spam filter, etc.

2

Filtering SystemNews

Blogs

Emails Passed documents

Page 3: Filtering Semi-Structured Documents Based on Faceted Feedback

Semi-Structured Documents

• Increasingly prevalent over the Internet• Emails, news, movies, tweets, etc.

• Plenty of metadata available

3

Page 4: Filtering Semi-Structured Documents Based on Faceted Feedback

Definitions

• Facet: a metadata field– Date, Topic, Location, Director, Genre, etc.

• Facet-Value Pair (FVP): a metadata field assigned with a particular value– Topic: Royal wedding– Date: 04-29-2011– Location: London, UK

4Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Page 5: Filtering Semi-Structured Documents Based on Faceted Feedback

Motivation

• Existing filtering approaches learn user interests based on users’ relevance judgments of documents

• Users may have prior knowledge on which facet-value pairs are relevant– English-only readers

• “Language: English”– Social network analysts

• “Company: Facebook”

5Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Page 6: Filtering Semi-Structured Documents Based on Faceted Feedback

6

Can we exploit users’ prior knowledge on facet-value pairs for filtering?

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Page 7: Filtering Semi-Structured Documents Based on Faceted Feedback

A New User Interaction Mechanism:Faceted Feedback

7

Filtering System

FVP candidates: Lang: … Topic: … Date: …

Relevant FVPs: Topic: … Lang: …

Page 8: Filtering Semi-Structured Documents Based on Faceted Feedback

Research Questions

• Question 1– How to select facet-value pair candidates?

• Question 2– How to learn user profiles based on faceted

feedback?

8Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Page 9: Filtering Semi-Structured Documents Based on Faceted Feedback

Q1: Possible Methods

• Feature selection methods for text classification– E.g., Mutual Information, Chi-Square measure, etc.

• Usually a large number of labeled documents available

• Query expansion methods for retrieval– E.g., TFIDF score on pseudo relevant documents

• No labeled documents available

9Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Page 10: Filtering Semi-Structured Documents Based on Faceted Feedback

FVP Selection: Our Approach

• In a filtering task– A large number of unlabeled documents– Possibly a small number of labeled documents

• We rank facet-value pairs by

10

Pseudo relevant (positively classified) documents

User-labeled relevant documents

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

fN

Nf log)(IDF

Intuition: features that occur frequently among relevant docs while rarely in the whole corpus are very likely to be relevant

Page 11: Filtering Semi-Structured Documents Based on Faceted Feedback

Research Questions

• Question 1– How to select facet-value pair candidates?

• Question 2– How to learn user profiles based on faceted

feedback?

11Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Page 12: Filtering Semi-Structured Documents Based on Faceted Feedback

Content-Based Filtering (CBF)

• Treated as a binary text classification task• User profile: a feature vector that represents a

user’s information needs (interests/preferences)

• Given the user profile θ, a document can be determined as relevant or not according to:

12Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Document vectorDocument label

The core of CBF is learning the user profile!

Page 13: Filtering Semi-Structured Documents Based on Faceted Feedback

Q2: Possible Methods

• Simple methods– Boolean strategy (AND, OR)– Feature selection– Pseudo relevant document

• Sophisticated methods– Bayesian logistic regression with an adjusted prior

(Dayanik et al. 06)– Generalized Expectation Criteria (Druck et al. 08)

13Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Page 14: Filtering Semi-Structured Documents Based on Faceted Feedback

Our Approach

• The assumption– A feature is selected by a user since it has a high

correlation with the document label (R/NR)

• Generalized Constraint Model (GCM)

14Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Page 15: Filtering Semi-Structured Documents Based on Faceted Feedback

Correlation Decomposition

• Sufficiency– The probability of a document being relevant given

that the feature has occurred: P(R+|f=1)– P(R+|f=1)=1 : sufficient features

• E.g., “Company: Facebook” for social network analysts

• Necessity– The probability of the feature having occurred given

that a document is relevant: P(f=1|R+)– P(f=1|R+)=1 : necessary features

• E.g., “Language: English” for English-only readers

15Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Page 16: Filtering Semi-Structured Documents Based on Faceted Feedback

Examples: Highly-Correlated Features

16

The whole corpus

R+

f2=1

f1=1

f3=1

1) f1 is a sufficient feature since P(R+|f1=1)=1

2) f2 is a necessary feature since P(f2=1|R+)=1

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

3) f3 is neither necessary nor sufficient, but both its sufficiency and necessity are high (>0.5)

Page 17: Filtering Semi-Structured Documents Based on Faceted Feedback

Estimating Sufficiency

17

Document label

The feature

The set of documents covered by feature f

User profile vector

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Estimation of the label of document di

Page 18: Filtering Semi-Structured Documents Based on Faceted Feedback

Estimating Necessity

18

Feature sufficiency

Bayes’ Theorem!

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Prior distribution

Page 19: Filtering Semi-Structured Documents Based on Faceted Feedback

Reference Distributions

• Our assumption– User selects a feature since it has a high sufficiency

and/or a high necessity

• Reference distributions: two Bernoulli dist’ns – The sufficiency/necessity of a user-selected feature

should be close to the reference distribution– KL-divergence for similarity measure

19Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Page 20: Filtering Semi-Structured Documents Based on Faceted Feedback

User Profile Learning• The unified loss function to combine two types

of feedback:

20

User-labeled documents

Necessary features

Sufficient features

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Ts , Tn: reference dist’ns

Page 21: Filtering Semi-Structured Documents Based on Faceted Feedback

User Interaction Mechanisms

• Two mechanisms– Mechanism 1: ask users to select features they

think are relevant– Mechanism 2: ask users to specifically select

features they think are sufficient and necessary respectively

21Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Page 22: Filtering Semi-Structured Documents Based on Faceted Feedback

Outline

• Introduction• Faceted Feedback

– Facet-Value Pair Candidate Selection– Learning from Faceted Feedback

• Experiments– Settings– Results

• Summary

22

Page 23: Filtering Semi-Structured Documents Based on Faceted Feedback

Data Sets

• Use two data sets from TREC filtering track– TREC 2000: OHSUMED (348566 medical articles) +

63 topics (information needs)• Metadata field: MeSH (Medical Subject Headings)

– TREC 2002: RCV1 (~800,000 news articles) + 50 topics defined by human assessors

• Metadata fields: Topic, Industry, Region

• Split each topic set into two equal-size subsets– One for parameter tuning, the other for testing

23Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Page 24: Filtering Semi-Structured Documents Based on Faceted Feedback

Faceted Feedback Collection

• Recruit subjects on Mechanical Turk– Five subjects per topic– The average performances will be reported

• For each topic, we show subjects– The topic description (information need)– A group of facet-value pair candidates

24Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Page 25: Filtering Semi-Structured Documents Based on Faceted Feedback

Evaluation Metrics

• Precision (macro)• Recall (macro)• T11U = 2 * Nrd – Nnd

– Nrd: the number of relevant docs delivered

– Nnd: the number of non-relevant docs delivered

• T11SU =– MinNU = -0.5– MaxU: the maximum possible utility (T11U)

25Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Page 26: Filtering Semi-Structured Documents Based on Faceted Feedback

Outline

• Introduction• Faceted Feedback

– Facet-Value Pair Candidate Selection– Learning from Faceted Feedback

• Experiments– Settings– Results

• Summary

26

Page 27: Filtering Semi-Structured Documents Based on Faceted Feedback

Results 1: w/wo Faceted Feedback (FF)

27

Faceted feedback improves filtering performances, especially when fewer relevant documents are initially known.

Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

# relevant docs initially known

Page 28: Filtering Semi-Structured Documents Based on Faceted Feedback

Results 2: Different Learning Algorithms

28

Our approach

Existing approaches

BOOL(A), BOOL(O): Boolean strategy

FS: feature selection based on FF

Pseudo-D/Q: pseudo relevant doc/query

Prior: logistic regression with Bayesian prior

GEC: generalized expectation criteria

Page 29: Filtering Semi-Structured Documents Based on Faceted Feedback

Outline

• Introduction• Faceted Feedback

– Facet-Value Pair Candidate Selection– Learning from Faceted Feedback

• Experiments– Settings– Results

• Summary

29

Page 30: Filtering Semi-Structured Documents Based on Faceted Feedback

Summary

• Faceted feedback is useful for filtering, especially in the cold-start scenarios

• The Generalized Constraint Model (GCM) is a robust user profile learning algorithm

• In future work, we will evaluate our methods on data sets where faceted features are more important– Movie, music, product, etc.

30Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Page 31: Filtering Semi-Structured Documents Based on Faceted Feedback

Questions?

31

Filtering Semi-Structured Documents Based on Faceted Feedback

Lanbo Zhang, Yi Zhang, Qianli XingInformation Retrieval and Knowledge Management (IRKM) Lab

University of California, Santa [email protected]

[email protected]@gmail.com