beyond keyword search: discovering relevant scientific literature

44
Beyond Keyword Search: Discovering Relevant Scientific Literature Khalid El-Arini and Carlos Guestrin August 22, 2011

Upload: jarah

Post on 22-Feb-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Beyond Keyword Search: Discovering Relevant Scientific Literature. Khalid El-Arini and Carlos Guestrin August 22, 2011. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Beyond Keyword Search:  Discovering Relevant Scientific Literature

Beyond Keyword Search: Discovering Relevant Scientific Literature

Khalid El-Arini and Carlos GuestrinAugust 22, 2011

Page 2: Beyond Keyword Search:  Discovering Relevant Scientific Literature

“It will be almost as convenient to search for some bit of truth concealed in nature as it will be to find it hidden away in an immense multitude of bound volumes.”

- Denis Diderot, 1755

Today:

107 papers

105 publications[Thomson Reuters Web of Knowledge]

Page 3: Beyond Keyword Search:  Discovering Relevant Scientific Literature

3

Keyword search is dominant…

…but is it natural?

Page 4: Beyond Keyword Search:  Discovering Relevant Scientific Literature

4

Specific research question

Is there an approximation algorithm for the submodular covering problem that doesn’t require an integral-valued objective function?

Any recent papers influenced by this?

Page 5: Beyond Keyword Search:  Discovering Relevant Scientific Literature

5

Literature reviewIt’s 11:30pm Samoa Time. Your “Related Work” section is a bit sparse.

Here are some papers we’ve cited so far. Anything else?

Page 6: Beyond Keyword Search:  Discovering Relevant Scientific Literature

Given a set of relevant query papers, what else

should I read?

Page 7: Beyond Keyword Search:  Discovering Relevant Scientific Literature

7

An example

query set

seminal/background paper?

a competing approach?

Cited by all query papers

Cites all query papers

However, unlikely to find papers directly connected to entire query set.

We need something more general…

Page 8: Beyond Keyword Search:  Discovering Relevant Scientific Literature

Select a set of papers A with maximum influence

to/from the query set Q

Page 9: Beyond Keyword Search:  Discovering Relevant Scientific Literature

9

Modeling influenceIdeas flow from cited papers to citing papers

Page 10: Beyond Keyword Search:  Discovering Relevant Scientific Literature

10

Modeling influenceIdeas flow from prior knowledge of the authors

Page 11: Beyond Keyword Search:  Discovering Relevant Scientific Literature

11

Influence contextWhy do I cite this paper?

generative model of textvariational inferenceEM…

we call these

concepts

Page 12: Beyond Keyword Search:  Discovering Relevant Scientific Literature

12

Concept representationWords, phrases or important technical termsProteins, genes, or other advanced features

Our assumption:

Influence always occurs in the context of concepts

Page 13: Beyond Keyword Search:  Discovering Relevant Scientific Literature

13

Influence by concept

plant stress

(Grayed-out nodes don’t contain the given concept)

Which shows more

influence?

Need to model the strength of each

edge

Page 14: Beyond Keyword Search:  Discovering Relevant Scientific Literature

14

Influence strength

common authorsdirect citation

oxygen

Page 15: Beyond Keyword Search:  Discovering Relevant Scientific Literature

15

Influence strength

(for normalization)

oxygen

Page 16: Beyond Keyword Search:  Discovering Relevant Scientific Literature

16

Influence strength

prevalence of “oxygen”

oxygen

Direct citations more indicative of influence than previous papers of the authors

Page 17: Beyond Keyword Search:  Discovering Relevant Scientific Literature

17

Influence strength

prevalence of “oxygen”

the weight between papers u and v w.r.t.

concept c

oxygen

Page 18: Beyond Keyword Search:  Discovering Relevant Scientific Literature

18

Influence strength

plant

prob. of influence between x and y with respect to concept c

Influence exists if there is an active path between x and y (w.r.t. concept

c)

Page 19: Beyond Keyword Search:  Discovering Relevant Scientific Literature

19

Computing influenceDefinition is intuitive, but intractable to compute exactly

#P-complete: the s-t network reliability problem

ApproximationsSampling

Sample complexity is provably logarithmic in size of corpus, but can still be slow in

practice.

Independence heuristic

Fast, dynamic programming-based approach, but no

explicit theoretical guarantees.

Page 20: Beyond Keyword Search:  Discovering Relevant Scientific Literature

Select a set of papers A with maximum influence

to/from the query set Qwhile maintaining: - relevance - diversity

Recall:

Page 21: Beyond Keyword Search:  Discovering Relevant Scientific Literature

24

Influence + Relevance

Influence should focus on relevant concepts:

Prevalent in query documents Q

Should be a main theme of some document in A

Page 22: Beyond Keyword Search:  Discovering Relevant Scientific Literature

25

Influence + DiversityWhy diversity?

Uncertainty about user’s information needDifferent approaches/facets to same research problem

Page 23: Beyond Keyword Search:  Discovering Relevant Scientific Literature

26

Influence + DiversityWhy diversity?

Uncertainty about user’s information needDifferent approaches/facets to same research problem

We take a probabilistic max cover approachquery papers

Page 24: Beyond Keyword Search:  Discovering Relevant Scientific Literature

27

Influence + DiversityWhy diversity?

Uncertainty about user’s information needDifferent approaches/facets to same research problem

We take a probabilistic max cover approachquery papers

plant oxygenstress plant oxygenstress plant oxygenstressconcepts

Page 25: Beyond Keyword Search:  Discovering Relevant Scientific Literature

28

Influence + DiversityWhy diversity?

Uncertainty about user’s information needDifferent approaches/facets to same research problem

We take a probabilistic max cover approachquery papers

plant oxygenstress plant oxygenstress plant oxygenstressconcepts

candidatepapers

Page 26: Beyond Keyword Search:  Discovering Relevant Scientific Literature

29

Influence + DiversityWhy diversity?

Uncertainty about user’s information needDifferent approaches/facets to same research problem

We take a probabilistic max cover approachquery papers

plant oxygenstress plant oxygenstress plant oxygenstressconcepts

candidatepapers

influence

Page 27: Beyond Keyword Search:  Discovering Relevant Scientific Literature

Set influence

32

Page 28: Beyond Keyword Search:  Discovering Relevant Scientific Literature

36

Putting it all togetherCan now write objective function exactly describing what we want:

maxhow do we solve this optimization?

Page 29: Beyond Keyword Search:  Discovering Relevant Scientific Literature

37

OptimizationOur objective is submodular

an intuitive diminishing returns property

Using simple greedy algorithm, can maximize objective efficiently and near-

optimally

Page 30: Beyond Keyword Search:  Discovering Relevant Scientific Literature

39

Recapquery set

max

result set

Page 31: Beyond Keyword Search:  Discovering Relevant Scientific Literature

But should all users get the same results?

Page 32: Beyond Keyword Search:  Discovering Relevant Scientific Literature

41

Personalized trustDifferent communities trust different researchers for a given concept

Goal: Estimate personalized trust from limited user input

e.g., network

Kleinberg HintonPearl

Page 33: Beyond Keyword Search:  Discovering Relevant Scientific Literature

42

Specifying trust preferences

Specifying trust should not be an onerous taskAssume given (nonexhaustive!) set of trusted papers B, e.g.,

a BibTeX file of all the researcher’s previous citationsa short list of favorite conferences and journalssomeone else’s citation history!

a committee member?journal editor?someone in another field?a Turing Award winner?

Page 34: Beyond Keyword Search:  Discovering Relevant Scientific Literature

Given trusted set B, how much do I trust author a

with respect to concept c?

Page 35: Beyond Keyword Search:  Discovering Relevant Scientific Literature

44

Computing trustHow much do I trust Jon Kleinberg with respect to the concept “network”?

B

Kleinberg’s papers

0.2 0.4

An author is trusted if he/she influences the user’s trusted

set B

Page 36: Beyond Keyword Search:  Discovering Relevant Scientific Literature

45

Personalized Objective

Page 37: Beyond Keyword Search:  Discovering Relevant Scientific Literature

46

Personalized Objective

Does user trust at least one of authors of d with respect to concept c?

Page 38: Beyond Keyword Search:  Discovering Relevant Scientific Literature

networks

graphics

data mining

Page 39: Beyond Keyword Search:  Discovering Relevant Scientific Literature

48

User Study Evaluation16 PhD students in machine learningFor each participant:

Select a recent paper for which we wish to find related work (the study paper)Compare our algorithm and three state-of-the-art alternatives:

Relational Topic ModelInformation GenealogyGoogle Scholar

Show papers one at a time (double-blind), asking questions:

Would this paper have been useful to you when writing the study paper?

e.g.,

Page 40: Beyond Keyword Search:  Discovering Relevant Scientific Literature

49

Usefulness

our approachhi

gher

is b

ette

r

Our approach provides more useful and more must-read papers

Page 41: Beyond Keyword Search:  Discovering Relevant Scientific Literature

50

Trustour approach

high

er is

bet

ter

Our approach provides more trustworthy papers…

Page 42: Beyond Keyword Search:  Discovering Relevant Scientific Literature

51

Novelty

our approach

…but at the expense of some novelty.

Page 43: Beyond Keyword Search:  Discovering Relevant Scientific Literature

52

Diversity

Our approach produces more diverse results.

Page 44: Beyond Keyword Search:  Discovering Relevant Scientific Literature

53

SummaryOften difficult to phrase information needs as keyword queries

Define query as small set of related papersEfficiently optimize submodular objective function based on intuitive notion of influence to select highly relevant articlesIncorporate trust preferences to produce personalized resultsParticipants in user study found our method to be more useful, trustworthy and diverse than other popular alternatives.live site coming

soon!