information retrieval to knowledge retrieval , one more step

104
Information Retrieval to Knowledge Retrieval, one more step Xiaozhong Liu Assistant Professor School of Library and Information Science Indiana University Bloomington

Upload: quiana

Post on 25-Feb-2016

69 views

Category:

Documents


2 download

DESCRIPTION

Information Retrieval to Knowledge Retrieval , one more step. Xiaozhong Liu Assistant Professor School of Library and Information Science Indiana University Bloomington. What is Information?. What is Retrieval?. What is Information Retrieval?. I am Retriever. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Information  Retrieval  to Knowledge  Retrieval , one more step

Information Retrieval to Knowledge Retrieval, one more step

Xiaozhong LiuAssistant Professor

School of Library and Information ScienceIndiana University Bloomington

Page 2: Information  Retrieval  to Knowledge  Retrieval , one more step

What is Information?

What is Retrieval?

What is Information Retrieval?

Page 3: Information  Retrieval  to Knowledge  Retrieval , one more step

I am Retriever

Page 4: Information  Retrieval  to Knowledge  Retrieval , one more step

How to find this book in Library?

Page 5: Information  Retrieval  to Knowledge  Retrieval , one more step

Search something based on User Information Need!!

How to express your information need?

Query

Page 6: Information  Retrieval  to Knowledge  Retrieval , one more step

User Information Need!!

Query

What is Good query?What is Bad query?

Good query: query ≈ information needBad query: query ≠ information need

Wait!!! User NEVER make mistake!!!It’s OUR job!!!

Page 7: Information  Retrieval  to Knowledge  Retrieval , one more step

Task 1: Given user information need, how to help (or automatically) help user propose a better query?

If there is a query… Perfect query:

𝑄𝑢𝑒𝑟𝑦𝑜𝑝𝑡𝑖𝑚𝑖𝑧𝑒

User input query:

𝑄𝑢𝑒𝑟𝑦𝑢𝑠𝑒𝑟

Page 8: Information  Retrieval  to Knowledge  Retrieval , one more step

User Information Need!!

Query ResultsGiven a query,How to retrieve results?

What is Good results?What is Bad results?

Page 9: Information  Retrieval  to Knowledge  Retrieval , one more step

Task 2: Given a query (not perfect), how to retrieveDocuments from collection?

Very Large, UnstructuredText Data!!!

F(query, doc)

Can you give me an example?

Page 10: Information  Retrieval  to Knowledge  Retrieval , one more step

F(query, doc)

If query term exist in docYes, this is result

If query term NOT exist in docNo, this is not result

Is there any problem in this function?Brainstorm…

Page 11: Information  Retrieval  to Knowledge  Retrieval , one more step

Query: Obama’s wife

Doc 1. My wife supports Obama’s new policy on…

Doc 2. Michelle, as the first lady of the United States…

Yes, this is a very challenging task!

Page 12: Information  Retrieval  to Knowledge  Retrieval , one more step

Another problem Collection size: 5 billionMatch doc: 5

My algorithm successfully finds all the 5 docs! In… 3 billion results…

Page 13: Information  Retrieval  to Knowledge  Retrieval , one more step

User Information Need!!

Query Results

How to help user find the results from all the

retrieved results?

Page 14: Information  Retrieval  to Knowledge  Retrieval , one more step

Task 3: Given retrieved results, how to help you find their results?

If retrieval algorithm retrieved 1 billion results from collection, what will you do???

Search with Google, click “next”???

Yes, we can help user find what they need!

Page 15: Information  Retrieval  to Knowledge  Retrieval , one more step

Query: Indiana University Bloomington

Can you read it One by one?

You use it??

Page 16: Information  Retrieval  to Knowledge  Retrieval , one more step

User Information Need!!

Query Results

1

2

3

User

System

Page 17: Information  Retrieval  to Knowledge  Retrieval , one more step

User Information Need!!

Query Results

1

2

3

User

System

They are not independent!

Page 18: Information  Retrieval  to Knowledge  Retrieval , one more step

Information Retrieval

Text

Image

Music

Map

……

Page 19: Information  Retrieval  to Knowledge  Retrieval , one more step

Information Retrieval

Text

Image

Music

Map

……

documentweb

scholar

blog

news

Page 20: Information  Retrieval  to Knowledge  Retrieval , one more step

Index

Page 21: Information  Retrieval  to Knowledge  Retrieval , one more step

Documents vs. Database Records• Relational database records are typically made up of well-

defined fields Select * from students where GPA > 2.5

We need a more effective way to index the text!

Text, similar way? Find all the docs including “Xiaozhong”

Select * from documents where text like ‘%xiaozhong%’

Page 22: Information  Retrieval  to Knowledge  Retrieval , one more step

Collection C: doc1, doc2, doc3 ……… docN

Query q: q1, q2, q3 ……… qt where qx is the query term

Document doci : di1, di2, di3 ……… dim All dij V

Vocabulary V: w1, w2, w3 ……… wn

Page 23: Information  Retrieval  to Knowledge  Retrieval , one more step

Collection C: doc1, doc2, doc3 ……… docN

V: w1, w2, w3 ……… wn

Doc1 1 0 0 1

Doc2 0 0 0 1

Doc3 1 1 1 1

DocN 1 0 1 1

………

Query q: 0, 1, 0 ………

Page 24: Information  Retrieval  to Knowledge  Retrieval , one more step

Collection C: doc1, doc2, doc3 ……… docN

V: w1, w2, w3 ……… wn

Doc1 3 0 0 9

Doc2 0 0 0 7

Doc3 2 11 21 1

DocN 7 0 1 2

………

Query q: 0, 3, 0 ………

Normalization is very important!

Page 25: Information  Retrieval  to Knowledge  Retrieval , one more step

Collection C: doc1, doc2, doc3 ……… docN

V: w1, w2, w3 ……… wn

Doc1 0.41 0 0 0.62

Doc2 0 0 0 0.12

Doc3 0.42 0.11 0.34 0.13

DocN 0.01 0 0.19 0.24

………

Query q: 0, 0.37, 0 ………

Normalization is very important!

Weight

Page 26: Information  Retrieval  to Knowledge  Retrieval , one more step

Term weighting

TF * IDF

Term frequency: freq (w, doc) / | doc|Or…

Inverse document frequency1+ log(N/k)N total num of docs in collectionk total num of docs with word w

An effective way to weight each word in a document

Page 27: Information  Retrieval  to Knowledge  Retrieval , one more step

Index

Space?

Speed?

Retrieval Model?

Ranking?

Semantic?

Document representation meets the requirement of retrieval system

Page 28: Information  Retrieval  to Knowledge  Retrieval , one more step

StemmingEducation

Educational

Educate

EducatingEducations

Educat

Very effective to improve system performance.

Some risk! E.g. LA Lakers = LA Lake?

Page 29: Information  Retrieval  to Knowledge  Retrieval , one more step

Doc 1: I love my cat.Doc 2: This cat is lovely!Doc 3: Yellow cat and white cat.

Inverted index

I love my cat this is lovely yellow and write

i love cat thi yellow write i - 1love - 1, 2thi - 2cat - 1, 2, 3yellow - 3write - 3

We lose something?

Page 30: Information  Retrieval  to Knowledge  Retrieval , one more step

Doc 1: I love my cat.Doc 2: This cat is lovely!Doc 3: Yellow cat and white cat.

Inverted index

i - 1love - 1, 2thi - 2cat - 1, 2, 3yellow - 3write - 3

i – 1:1love – 1:1, 2:1thi – 2:1cat – 1:1, 2:1, 3:2yellow – 3:1write – 3:2

We still lose something?

Page 31: Information  Retrieval  to Knowledge  Retrieval , one more step

Doc 1: I love my cat.Doc 2: This cat is lovely!Doc 3: Yellow cat and white cat.

Inverted index

i – 1:1love – 1:1, 2:1thi – 2:1cat – 1:1, 2:1, 3:2yellow – 3:1write – 3:2

i – 1:1love – 1:2, 2:4thi – 2:1cat – 1:4, 2:2, 3:2, 3:5yellow – 3:2write – 3:4

Why do you need position info?

Page 32: Information  Retrieval  to Knowledge  Retrieval , one more step

Doc 1: information retrieval is important for digital library.

Doc 2: I need some information about the dogs, my favorite is golden retriever.

Proximity of query terms query: information retrieval

Page 33: Information  Retrieval  to Knowledge  Retrieval , one more step

Doc 1: information retrieval is important for digital library.

Doc 2: I need some information about the dogs, my favorite is golden retriever.

Index – bag of wordsquery: information retrieval

What’s the limitation of bag-of-words? Can we make it better?

n-gram:

Doc 1: information retrieval, retrieval is, is important, important for ……

bi-gram

Better semantic representation!What’s the limitation?

Page 34: Information  Retrieval  to Knowledge  Retrieval , one more step

Doc 1: …… big apple ……

Doc 2: …… apple……

Index – bag of “phrase”?

More precision, less ambiguous

How to identify phrases from documents?

Identify syntactic phrases using POS taggingn-gramsfrom existing resources

Page 35: Information  Retrieval  to Knowledge  Retrieval , one more step

Noise detection

What is the noise of web page? Non-informative content…

Page 36: Information  Retrieval  to Knowledge  Retrieval , one more step

Web Crawler - freshness

Web is changing, but we cannot constantly check all the pages…

Need to find the most important page that change freq

www.nba.com

www.iub.edu

www.restaurant????.com

Sitemap: a list of urls for each host; modification time and freq

Page 37: Information  Retrieval  to Knowledge  Retrieval , one more step

Retrieval

Page 38: Information  Retrieval  to Knowledge  Retrieval , one more step

Model

Mathematical modeling is frequently used with the objective to understand, explain, reason and predict behavior or phenomenon in the real world (Hiemstra, 2001).

i.e. some model help you to predict tomorrow stock price…

Page 39: Information  Retrieval  to Knowledge  Retrieval , one more step

Hypothesis:

Retrieval and ranking problem = Similarity Problem!

Vector Space Model

Is that a good hypothesis? Why?

Retrieval Function: Similarity (query, Document)

Return a score!!! We can Rank the documents!!!

Page 40: Information  Retrieval  to Knowledge  Retrieval , one more step

So, query is a short document

Vector Space Model

Page 41: Information  Retrieval  to Knowledge  Retrieval , one more step

Collection C: doc1, doc2, doc3 ……… docN

V: w1, w2, w3 ……… wn

Doc1 0.41 0 0 0.62

Doc2 0 0 0 0.12

Doc3 0.42 0.11 0.34 0.13

DocN 0.01 0 0.19 0.24

………

Query q: 0, 0.37, 0 ………

Page 42: Information  Retrieval  to Knowledge  Retrieval , one more step

Collection C: doc1, doc2, doc3 ……… docN

V: w1, w2, w3 ……… wn

Doc1 0.41 0 0 0.62

Doc2 0 0 0 0.12

Doc3 0.42 0.11 0.34 0.13

DocN 0.01 0 0.19 0.24

………

Query q: 0, 0.37, 0 ………

Similarity

Doc Vector

Query Vector

Page 43: Information  Retrieval  to Knowledge  Retrieval , one more step

Doc1: ……Cat……dog……cat……Doc2: ……Cat……dogDoc3: ……snake……

Query: dog cat cat

dog

2

1

doc 1

doc 2

doc 3

Page 44: Information  Retrieval  to Knowledge  Retrieval , one more step

Doc1: ……Cat……dog……cat……Doc2: ……Cat……dogDoc3: ……snake……

Query: dog cat

F (q, doc) = cosine similarity (q, doc)

cat

dog

2

1

doc 1

doc 2 = query

doc 3

θ

Why Cosine?

Page 45: Information  Retrieval  to Knowledge  Retrieval , one more step

Vector Space Model

Dimension = n = vocabulary size

Query q: q1, q2, q3 ……… qn Same dimensional space!!!Document doci : di1, di2, di3 ……… din All dij V

Vocabulary V: w1, w2, w3 ……… wn

Page 46: Information  Retrieval  to Knowledge  Retrieval , one more step

Doc1: ……Cat……dog……cat……Doc2: ……Cat……dogDoc3: ……snake……

Query: dog cat

Try!

Page 47: Information  Retrieval  to Knowledge  Retrieval , one more step

Term weighting

Doc [ 0.42 0.11 0.34 0.13 ]

weight, how?

TF * IDF

Term frequency: freq (w, doc) / | doc|Or…

Inverse document frequency1+ log(N/k)N total num of docs in collectionk total num of docs with word w

Page 48: Information  Retrieval  to Knowledge  Retrieval , one more step

More TF

Weighting is very important for retrieval model!We can improve TF by…

i.e.freq (term, doc)log [freq (term, doc)]

BM25:

Page 49: Information  Retrieval  to Knowledge  Retrieval , one more step

Vector Space Model

But…

Bag of word assumption = Word independent!

Query = Document, maybe not true!

Vector and SEO (Search Engine Optimization)…

synonym? Semantic related words?

Page 50: Information  Retrieval  to Knowledge  Retrieval , one more step

How about these…

Pivoted Normalization Method

Dirichlet Prior Method

TF IDFNormalization

+parameter

Page 51: Information  Retrieval  to Knowledge  Retrieval , one more step

Language model

Probability distribution over words

P (I love you) = 0.01P (you love I) = 0.00001P (love you I) = 0.0000001

If we have this information… we could build a generative model!

P(text | )

Page 52: Information  Retrieval  to Knowledge  Retrieval , one more step

Language model - unigram

Generate text with bag-of-word assumption (word independent):

P (w1, w2,…wn) = P(w1) P(w2)…P(wn)

food orange desk USB computer Apple Unix …. …. …. milk sport superbowl

topic X = ???

Page 53: Information  Retrieval  to Knowledge  Retrieval , one more step

food orange desk USB computer Apple Unix …. …. milk yogurt iPad NBA sport superbowl NHL score information unix USB

topic 1topic 2

Doc: I’m using Mac computer… remote access another computer… share some USB device…

P(Doc | topic1) vs. P(Doc | topic2)

Page 54: Information  Retrieval  to Knowledge  Retrieval , one more step

king ghost hamlet play …. …. romeo juliet iPad iplhone 4s tv apple …… play store

Page 55: Information  Retrieval  to Knowledge  Retrieval , one more step

food orange desk USB computer Apple Unix …. …. …. milk sport superbowl

topicX

How to estimate???

If we have enough data, i.e. docs about topic X

10/10000 1000/10000 30/10000

P(“computer” | topic X)

Page 56: Information  Retrieval  to Knowledge  Retrieval , one more step

food orange desk USB computer Apple Unix …. …. milk yogurt iPad NBA sport superbowl NHL score information unix USB

doc 1doc 2

query: sport game watch

P(query | doc 1) vs. P(query | doc 2)

Page 57: Information  Retrieval  to Knowledge  Retrieval , one more step

a document doc:

query likelihood query term likelihood

Retrieval Problem Query likelihood Term likelihood P(qi | doc)

But document is a small sample of topic… Data like:

Smoothing!

Page 58: Information  Retrieval  to Knowledge  Retrieval , one more step

P(qi | doc) What if qi is not observed in doc? P(qi | doc) = 0?

We want give this non-zero score!!!

Smoothing

i.e.

We can make it better!

Page 59: Information  Retrieval  to Knowledge  Retrieval , one more step

Smoothing

First, it addresses the data sparseness problem. As a document is only a very small sample, the probability P (qi | Doc) could be zero for those unseen words (Zhai & Lafferty, 2004).

Second, smoothing helps to model the background (non-discriminative) words in the query.

Improve language model estimation by using Smoothing

Page 60: Information  Retrieval  to Knowledge  Retrieval , one more step

Smoothing

Another smoothing method:

P (w | )

if the word exist in doc

if the word not exist in doc

P (w | doc)

P (w | collection) Collection Language Model

P (w | ) = (1-λ) ∙P( query | θdoc)+λ∙P(doc| θcollection)

Page 61: Information  Retrieval  to Knowledge  Retrieval , one more step

Smoothing

We could use collection language model:

TFIDF is closely related to Language Model and other retrieval models

Term Freq

IDFDoc length norm

Page 62: Information  Retrieval  to Knowledge  Retrieval , one more step

Language model

Solid statistical foundation

Flexible parameter setting

Different smoothing method

Page 63: Information  Retrieval  to Knowledge  Retrieval , one more step

Language model in library?

If we have a paper… and a query…

Similarity (paper, query) Vector Space Model

If query word not in the paper…

Score = 0

If we use language model…

Page 64: Information  Retrieval  to Knowledge  Retrieval , one more step

Language model in library?

Likelihood of query given a paper can be estimated by:

P(query | ) = αP (query | paper) + βP (query | author) +γP (query | journal) +……

Likelihood of query given a paper & author & journal & ……

Page 65: Information  Retrieval  to Knowledge  Retrieval , one more step

e.g. what’s the difference between web and doc retrieval???

F (doc, query)

F (web page, query)

vs

web page = doc + hyperlink + domain info + anchor text + metadata + …Can you use those to improve system performance???

Page 66: Information  Retrieval  to Knowledge  Retrieval , one more step

Knowledge

Page 67: Information  Retrieval  to Knowledge  Retrieval , one more step
Page 68: Information  Retrieval  to Knowledge  Retrieval , one more step

Score each topic, level of interest

Topic 1

Topic 2

Page 69: Information  Retrieval  to Knowledge  Retrieval , one more step

CI-n … CI-2 CI-1 CI-now

)|({)]|([/)|()]}|([)]|([)|({

)]|([/)|()]}|([)]|([)|({)(

ntoday

nintodayninintoday

nintodayninintoday

n

ZDayPelseZDayPmeanZDayPbZDayPSTDZDayPmeanZDayPifelse

ZDayPmeanZDayPaZDayPSTDZDayPmeanZDayPifTopicScore

Hot topic Diminishing topic Regular topic

CurrentInterestHistorical Interest

Page 70: Information  Retrieval  to Knowledge  Retrieval , one more step

“Obama”, Nov 5th 2008 After Election

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 300

1

2

3

4

5

6

Nov 5th CIV:

Wiki:Barack_Obama; Wiki:Election; win; success; Wiki:President_of_the_United_States

Wiki:African_American; PresidentWorld; America; victory; record; first;president ; 44th; History; Wiki:Victory_Records

Entity:first_black_president;

Entity:first_black_president; Celebrate; black; african;

Wiki:Colin_Powell; Wiki:Secretary_of_StateWiki:United_States

Wiki:Sarah_Palin; sarah; palin; hillarySecret; Wiki:Hillary_Rodham_Clinton

Clinton; newsweek; club; cloth

1. Win2. Create history3. First black president

Page 71: Information  Retrieval  to Knowledge  Retrieval , one more step

Google web NDCG3 NDCG5 NDCG10 t-testCIV 0.35909366 0.399970894 0.479302401  CILM 0.356652652 0.387120299 0.483420045  Google 0.230423817 0.318737414 0.388792379 **TFIDF 0.27596245 0.333012091 0.437831859 *BM25 0.284599431 0.336961764 0.436466778 *LM (liner) 0.32558799 0.382113457 0.473992963  LM (dirichlet) 0.34665084 0.358128576 0.45150825  LM (twostage) 0.349735965 0.358725227 0.450046444  BEST1: CIV CIV CILM  BEST2: CILM CILM CIV  Significant test *** t < 0.05 ** t < 0.10 * t < 0.15

Yahoo_web NDCG3 NDCG5 NDCG10 t-testCIV 0.351765133 0.38207777 0.475506721  CILM 0.391807685 0.40623334 0.482464858  Yahoo 0.288059321 0.326373542 0.410969176  TFIDF 0.24320988 0.282799657 0.404092457 ***BM25 0.245263974 0.277579262 0.395953269 ***LM (liner) 0.276208943 0.316889107 0.432428784 *LM (dirichlet) 0.223253393 0.270017519 0.385936078 ***LM (twostage) 0.219225991 0.266537146 0.384349848 ***BEST1: CILM CILM CILM  BEST2: CIV CIV CIV  Significant test *** t < 0.05 ** t < 0.10 * t < 0.15

Page 72: Information  Retrieval  to Knowledge  Retrieval , one more step

Knowledge Retrieval System

Knowledge-based Information Need

Knowledge within Scientific Literature

Matching

Query Knowledge Representation

How to help user propose

knowledge-base queries ?

How to represent

knowledge?

How to match

between the two?

Page 73: Information  Retrieval  to Knowledge  Retrieval , one more step

Academic Knowledge

Page 74: Information  Retrieval  to Knowledge  Retrieval , one more step

74

Query Recommendation & Feedback

Query Recommendation

Query Feedback

Page 75: Information  Retrieval  to Knowledge  Retrieval , one more step
Page 76: Information  Retrieval  to Knowledge  Retrieval , one more step

76

Structural Keyword Generation- FeaturesCategory Feature Description or Example

Keyword Content

Text content of the keyword, stemmed, case insensitive, stop words removed

Content_Of_Keyword a vector of all the tokens in the keywordCAP whether the keyword is capitalized

Contain_Digit whether the keyword contains digits, i.e., TREC2002, value = trueCharacter_Length_Of_Keyword number of characters in the target keyword

Token_Length_Of_Keyword number of tokens in the keyword

Category_Length_Of_Keyword number of tokens in the keyword; if the length is more than four, we use four to represent its category length

Title Context

Exist_In_Title whether keyword exists in title (stemmed, case insensitive, stop words removed)

Location_In_Title the position where the keyword appears in the titleTitle_Text_POS unigram and its part of speech in title (in a text window)Title_Unigram unigram of keyword in title (in a text window)Title_Bigram bigram of keyword in title (in a text window)

Abstract Context

Location_In_Abstract which sentence the keyword appears in the abstractKeyword_Position_In_Sentence_O

f_Abstract the keyword’s position in the sentence (beginning, middle or end)

Abstract_Freq how many times a keyword appears in the abstractAbstract_Text_POS unigram and its part of speech in abstract (in a text window)Abstract_Unigram unigram of keyword in abstract (in a text window)Abstract_Bigram bigram of keyword in abstract (in a text window)

Page 77: Information  Retrieval  to Knowledge  Retrieval , one more step

Evaluation – Domain Knowledge Generation

F1 Compare Concept Supervised Semi-supervised

Keyword-based

features

Research Question 0.637 0.662

Methodology 0.479 0.516Dataset 0.824 0.816

Evaluation 0.571 0.571

Keyword + Title-based

features

Research Question 0.633 0.667

Methodology 0.498 0.534Dataset 0.824 0.816

Evaluation 0.571 0.571

Keyword + Title +

Abstract-based

features

Research Question 0.642 0.663

Methodology 0.420 0.542Dataset 0.831 0.823

Evaluation 0.621 0.662

F measure comparison for Supervised Learning and Semi-Supervised Learning

GOOD! but not PERFECT…

Page 78: Information  Retrieval  to Knowledge  Retrieval , one more step

Knowledge comes from…

• System? Machine Learning, but… low modest performance…

• User? No way! Very high cost! Author won’t contribute…

• System + User? Possible!

Page 79: Information  Retrieval  to Knowledge  Retrieval , one more step

WikiBackyard

ScholarWiki

EditTrigger: 1. Wiki page improve; 2. Machine learning model improve; 3. All other wiki pages improve; 4. KR index improve!

Page 80: Information  Retrieval  to Knowledge  Retrieval , one more step

User + Machine learning is powerful…YES! It helps!!!

Page 81: Information  Retrieval  to Knowledge  Retrieval , one more step

• Knowledge retrieval for scholar publications…• Knowledge from paper• Knowledge from user– Knowledge feedback– Knowledge recommendation

• Knowledge from User vs. from Machine learning

• ScholarWiki (user) + WikiBackyard (machine)

Page 82: Information  Retrieval  to Knowledge  Retrieval , one more step

Knowledge via Social Network and Text Mining

Page 83: Information  Retrieval  to Knowledge  Retrieval , one more step

CITATION? CO-OCCUR?CO-AUTHOR?

Page 84: Information  Retrieval  to Knowledge  Retrieval , one more step

Content of each node?Motivation of each citation?

With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation counts and has offered new opportunities for citation analysis.

Full text citation analysis

Page 85: Information  Retrieval  to Knowledge  Retrieval , one more step

With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation counts and has offered new opportunities for citation analysis.

Every word @ Citation Context will VOTE!! Motivation? Topic? Reason??? Left and Right N words??N = ??????????

Page 86: Information  Retrieval  to Knowledge  Retrieval , one more step

With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation counts and has offered new opportunities for citation analysis.

Word effectiveness is decaying based on the distance!!!

Closer words make more significant contribution!!

Page 87: Information  Retrieval  to Knowledge  Retrieval , one more step

How about language model? Each node and edge represented by a language model?High dimensional space! Word difference?

Page 88: Information  Retrieval  to Knowledge  Retrieval , one more step

Topic modeling – each node is represented by a topic distribution (Prior Distribution); each edge is represented by a topic distribution (Transitioning Probability Distribution)

Page 89: Information  Retrieval  to Knowledge  Retrieval , one more step

Supervised topic modeling

1. Each topic has a label (YES! We can interpret each topic)2. We DO KNOW the total number of topics

Each paper is a mix probability distribution of Author Given Keywords

Keywords

Page 90: Information  Retrieval  to Knowledge  Retrieval , one more step

Each paper: pzkeyi(paper) = p(zkeyi | abstract, title)

With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation counts and has offered new opportunities for citation analysis.

Page 91: Information  Retrieval  to Knowledge  Retrieval , one more step

Paper importance

if we have 3 topics (keywords): key1, key2, key3

Domain credit: 100

pub 1

25

pub 2

25

pub 3

25

pub 4

25

P(key1 | text) = 0.6P(key2 | text) = 0.15 P(key3 | text) = 0.25

Key1-Pub1 credit: 25 * 0.6

P(key1 | citation) = 0.8P(key2 | citation) = 0.1 P(key3 | citation) = 0.1

Key1-Citation1 credit: 25 * 0.6*[0.8/(0.8+0.2)]

0.80.2

Evenly share the credits?

Citation is important if 1. citation focusing on important topic 2. other citations focusing on other topics

Page 92: Information  Retrieval  to Knowledge  Retrieval , one more step

Paper importance

if we have 3 keywords: key1, key2, key3

Domain credit: 100

pub 1

25

pub 2

25

pub 3

25

pub 4

25

Key1-Pub1 credit: 25 * 0.6

Key1-Citation1 credit: 25 * 0.6*[0.8/(0.8+0.2)]

0.80.2

[25,25,25]

[29,26,28] [27,27,26]

[25,25,25]

Domain publication rankingDomain keyword topical rankingTopical citation tree

Citation number between paper pair is IMPORTANT!

Page 93: Information  Retrieval  to Knowledge  Retrieval , one more step

Different citations make different contribution to different topics (keywords) to the citing publication.

Page 94: Information  Retrieval  to Knowledge  Retrieval , one more step

Publication/venue/author topic prior

Citation transitioning topic prior

Page 95: Information  Retrieval  to Knowledge  Retrieval , one more step

nDCG@10 nDCG@30 nDCG@50 nDCG@100 nDCG@300 nDCG@500 nDCG@1000 nDCG@3000 nDCG@5000 nDCG@ALL0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

NDCG for Review citation recommendationN

DCG

Page 96: Information  Retrieval  to Knowledge  Retrieval , one more step

Literature Review Citation recommendation

Input: Paper Abstract

Output: A list of ranked citations

MAP and NDCG evaluation

Page 97: Information  Retrieval  to Knowledge  Retrieval , one more step

Given a paper abstract:

1. Word level match (language model)2. Topic level match (KL-Divergence)3. Topic importance

Use Inference Network to integrate each hypothesis

Page 98: Information  Retrieval  to Knowledge  Retrieval , one more step

Citation Recommendation

Content MatchPublication

Topical Prior

1. PageRank2. Full-text PageRank (greedy match)3. Full-text PageRank (topic modeling)

Topic match

Inference Network

Page 99: Information  Retrieval  to Knowledge  Retrieval , one more step

Input

Output:

1. [3] YES 32. [2] YES 23. [6] NO 04. [8] NO 05. [10] YES 16. [1] NO 0……

MAP(Cite or not?)

NDCG(Important citation?)

Page 100: Information  Retrieval  to Knowledge  Retrieval , one more step

nDCG@10 nDCG@30 nDCG@50 nDCG@100 nDCG@300 nDCG@500 nDCG@1000 nDCG@3000 nDCG@5000 [email protected]

0.15

0.2

0.25

0.3

0.35

0.4NDCG for citation recommendation based on Abstract

Based on greedy match, 1 second

Based on topic inference, 30 seconds

Page 101: Information  Retrieval  to Knowledge  Retrieval , one more step

CONCLUSION

• Information Retrieval• Index• Retrieval Model• Ranking• User feedback• Evaluation

• Knowledge Retrieval• Machine Learning• User Knowledge• Integration • Social Network Analysis

Page 102: Information  Retrieval  to Knowledge  Retrieval , one more step
Page 103: Information  Retrieval  to Knowledge  Retrieval , one more step
Page 104: Information  Retrieval  to Knowledge  Retrieval , one more step

Thank you!