statistical language models for biomedical literature retrieval chengxiang zhai department of...

Statistical Language Models for

Biomedical Literature Retrieval

ChengXiang Zhai

Department of Computer Science,

Institute for Genomic Biology ,

And Graduate School of Library & Information Science

University of Illinois, Urbana-Champaign

Motivation

• Biomedical literature serves as a “complete” documentation of the biomedical knowledge discovered by scientists

• Medline: > 10,000,000 literature abstracts (1966-)

• Effective access to biomedical literature is essential for– Understanding related existing discoveries

– Formulating new hypotheses

– Verifying hypotheses

– …

• Biologists routinely use PubMed to access literature (http://www.ncbi.nlm.nih.gov/PubMed)

http://www.ncbi.nlm.nih.gov/PubMed

http://www.ncbi.nlm.nih.gov/PubMed

Challenges in Biomedical Literature Retrieval

• Tokenization

– Many names are irregular with special characters such as “/”, “-”, etc. E.g., MIP-1-alpha, (MIP)-1alpha

– Ambiguous words: “was” and “as” can be genes

• Semi-structured queries

– It is often desirable to expand a query about a gene with synonyms of the gene; the expanded query would have several fields (original name + symbols)

– “Find the role of gene A in disease B” (3 fields)

• …

TREC Genomics Track

• TREC (Text REtrieval Conference):

– Started 1992; sponsored by NIST

– Large-scale evaluation of information retrieval (IR) techniques

• Genomics Track

– Started in 2003

– Still continuing

– Evaluation of IR for biomedical literature search

Typical TREC Cycle

• Feb: Application for participation

• Spring: Preliminary (training) data available

• Beginning of Summer: Official test data available

• End of Summer: Result submission

• Early Fall: Official evaluation; results are out in Oct

• Nov: TREC Workshop; plan for next year

UIUC Participation

• 2003: Obtained initial experience; recognized the problem of “semi-structured queries”

• 2005: Continued developing semi-structured language models

• 2006: Applied hidden Markov models to passage retrieval

Outline

• Standard IR Techniques

• Semi-structured Query Language Models

• Parameter Estimation

• Experiment Results

• Conclusions and Future Work

What is Text Retrieval (TR)?

• There exists a collection of text documents

• User gives a query to express the information need

• A retrieval system returns relevant documents to users

• More commonly known as “Information Retrieval” (IR)

• Known as “search technology” in industry

TR is Hard!

• Under/over-specified query

– Ambiguous: “buying CDs” (money or music?)

– Incomplete: what kind of CDs?

– What if “CD” is never mentioned in document?

• Vague semantics of documents

– Ambiguity: e.g., word-sense, structural

– Incomplete: Inferences required

• Even hard for people!

– 80% agreement in human judgments

TR is “Easy”!

• TR CAN be easy in a particular case

– Ambiguity in query/document is RELATIVE to the database

– So, if the query is SPECIFIC enough, just one keyword may get all the relevant documents

• PERCEIVED TR performance is usually better than the actual performance

– Users can NOT judge the completeness of an answer

Formal Formulation of TR

• Vocabulary V={w1, w2, …, wN} of language

• Query q = q1,…,qm, where qi V

• Document di = di1,…,dimi, where dij V

• Collection C= {d1, …, dk}

• Set of relevant documents R(q) C

– Generally unknown and user-dependent

– Query is a “hint” on which doc is in R(q)

• Task = compute R’(q), an “approximate R(q)”

Computing R(q)

• Strategy 1: Document selection

– R(q)={dC|f(d,q)=1}, where f(d,q) {0,1} is an indicator function or classifier

– System must decide if a doc is relevant or not (“absolute relevance”)

• Strategy 2: Document ranking

– R(q) = {dC|f(d,q)>}, where f(d,q) is a relevance measure function; is a cutoff

– System must decide if one doc is more likely to be relevant than another (“relative relevance”)

Document Selection vs. Ranking

++

+ +-- -

- - - -

- - - -

-

- - +- -

Doc Selectionf(d,q)=?

++

++

--+

-+

--

- --

---

Doc Rankingf(d,q)=?

1

0

0.98 d1 +0.95 d2 +0.83 d3 -0.80 d4 +0.76 d5 -0.56 d6 -0.34 d7 -0.21 d8 +0.21 d9 -

R’(q)

R’(q)

True R(q)

Problems of Doc Selection

• The classifier is unlikely accurate

– “Over-constrained” query (terms are too specific): no relevant documents found

– “Under-constrained” query (terms are too general): over delivery

– It is extremely hard to find the right position between these two extremes

• Even if it is accurate, all relevant documents are not equally relevant

• Relevance is a matter of degree!

Ranking is often preferred

• Relevance is a matter of degree

• A user can stop browsing anywhere, so the boundary is controlled by the user

– High recall users would view more items

– High precision users would view only a few

• Theoretical justification: Probability Ranking Principle [Robertson 77]

Evaluation Criteria

• Effectiveness/Accuracy

– Precision, Recall

• Efficiency

– Space and time complexity

• Usability

– How useful for real user tasks?

Methodology: Cranfield Tradition

• Laboratory testing of system components

– Precision, Recall

– Comparative testing

• Test collections

– Set of documents

– Set of questions

– Relevance judgments

The Contingency Table

Relevant Retrieved

Irrelevant Retrieved Irrelevant Rejected

Relevant RejectedRelevant

Not relevant

Retrieved Not RetrievedDocAction

Relevant

RetrievedRelevant Recall

Retrieved

RetrievedRelevant Precision

How to measure a ranking?

• Compute the precision at every recall point

• Plot a precision-recall (PR) curve

precision

recall

x

x

x

x

precision

recall

x

x

x

x

Which is better?

Summarize a Ranking• Given that n docs are retrieved

– Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs

– E.g., if the first rel. doc is at the 2nd rank, then p(1)=1/2.

– If a relevant document never gets retrieved, we assume the precision corresponding to that rel. doc to be zero

• Compute the average over all the relevant documents– Average precision = (p(1)+…p(k))/k

• This gives us (non-interpolated) average precision, which captures both precision and recall and is sensitive to the rank of each relevant document

• Mean Average Precisions (MAP)– MAP = arithmetic mean average precision over a set of topics

– gMAP = geometric mean average precision over a set of topics (more affected by difficult topics)

Precion-Recall Curve

Mean Avg. Precision (MAP)

Recall=3212/4728

Breakeven Point (prec=recall)

Out of 4728 rel docs, we’ve got 3212

D1 +D2 +D3 –D4 –D5 +D6 -

Total # rel docs = 4System returns 6 docs

Average Prec = (1/1+2/2+3/5+0)/4

about 5.5 docsin the top 10 docs

are relevant

Precision@10docs

Typical TR System Architecture

User

querydocs

results

Query Rep

Doc Rep (Index)

ScorerIndexer

Tokenizer

Index

judgmentsFeedback

Tokenization

• Normalize lexical units: Words with similar meanings should be mapped to the same indexing term

• Stemming: Mapping all inflectional forms of words to the same root form, e.g.

– computer -> compute

– computation -> compute

– computing -> compute (but king->k?)

• Porter’s Stemmer is popular for English

Relevance Feedback

Updatedquery

Feedback

Judgments:d1 +d2 -d3 +

…dk -...

Query RetrievalEngine

Results:d1 3.5d2 2.4…dk 0.5...

UserDocumentcollection

Pseudo/Blind/Automatic Feedback

Query RetrievalEngine

Results:d1 3.5d2 2.4…dk 0.5...

Judgments:d1 +d2 +d3 +

…dk -...

Documentcollection

Feedback

Updatedquery

top 10

Traditional approach = Vector space model

Vector Space Model

• Represent a doc/query by a term vector

– Term: basic concept, e.g., word or phrase

– Each term defines one dimension

– N terms define a high-dimensional space

– Element of vector corresponds to term weight

– E.g., d=(x1,…,xN), xi is “importance” of term i

• Measure relevance by the distance between the query vector and document vector in the vector space

VS Model: illustration

Java

Microsoft

Starbucks

D6

D10

D9

D4

D7

D8

D5

D11

D2 ? ?

D1

? ?

D3

? ?

Query

What’s a good “basic concept”?

• Orthogonal

– Linearly independent basis vectors

– “Non-overlapping” in meaning

• No ambiguity

• Weights can be assigned automatically and hopefully accurately

• Many possibilities: Words, stemmed words, phrases, “latent concept”, …

How to Assign Weights?

• Very important!

• Why weighting– Query side: Not all terms are equally important

– Doc side: Some terms carry more contents

• How?

– Two basic heuristics

• TF (Term Frequency) = Within-doc-frequency

• IDF (Inverse Document Frequency)

– TF normalization

Language Modeling Approaches are becoming more and more popular…

What is a Statistical LM?

• A probability distribution over word sequences

– p(“Today is Wednesday”) 0.001

– p(“Today Wednesday is”) 0.0000000000001

– p(“The eigenvalue is positive”) 0.00001

• Context-dependent!

• Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

The Simplest Language Model(Unigram Model)

• Generate a piece of text by generating each word INDEPENDENTLY

• Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)

• Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size)

• Essentially a multinomial distribution over words

• A piece of text can be regarded as a sample drawn according to this word distribution

Text Generation with Unigram LM

(Unigram) Language Model p(w| )

…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001

…

Topic 1:Text mining

…food 0.25nutrition 0.1healthy 0.05diet 0.02

…

Topic 2:Health

Document

Text miningpaper

Food nutritionpaper

Sampling

Estimation of Unigram LM

(Unigram) Language Model p(w| )=?

Document

text 10mining 5

association 3database 3algorithm 2

…query 1

efficient 1

…text ?mining ?assocation ?database ?…query ?

…

Estimation

A “text mining paper”(total #words=100)

10/1005/1003/1003/100

1/100

Language Models for Retrieval(Ponte & Croft 98)

Document

Text miningpaper

Food nutritionpaper

Language Model

…text ?mining ?assocation ?clustering ?…food ?

…

…food ?nutrition ?healthy ?diet ?

…

Query = “data mining algorithms”

? Which model would most likely have generated this query?

Ranking Docs by Query Likelihood

d1

d2

dN

qd1

d2

dN

Doc LM

p(q| d1)

p(q| d2)

p(q| dN)

Query likelihood

Kullback-Leibler (KL) Divergence Retrieval Model

• Unigram similarity model

• Retrieval Estimation of Q and D

• Special case: = empirical distribution of q

ˆ ˆ( ; ) ( || )

ˆ ˆ ˆ ˆ( | ) log ( | ) ( ( | ) log ( | ))

Q D

Q D Q Qw w

Sim d q D

p w p w p w p w

query entropy(ignored for ranking)

Q̂

( | ) 0

( | )ˆ( ; ) [ ( | ) log ] log( | )

i

i Q

seen ii Q d

w d d ip w

p w dsim q d p w

p w C

Estimating p(w|d) (i.e., D)

• Simplified Jelinek-Mercer: Shrink uniformly toward p(w|C)

)|()|()()|( Cwpdwpdwp ml 1

)|()|()|( ||||||

||)|();( Cwpdwpdwp dmld

dd

Cwpdwc

• Dirichlet prior (Bayesian): Assume pseudo counts p(w|C)

• Absolute discounting: Subtract a constant

||)|(||)0,);(max()|( d

Cwpddwc udwp

Estimating Q (Feedback)

Query Q

D

)||( DQD

Document D

Results

Feedback Docs F={d1, d2 , …, dn}

FQQ )1('

Generative model

Q

F=0

No feedback

FQ '

=1

Full feedback

QQ '

Generative Mixture Model

w

w

F={d1, …, dn}

log ( | ) ( ; ) log[(1 ) ( | ) ( | )]ii w

p F c w d p w p w C )|(logmaxarg

FpF Maximum Likelihood

P(w| )

P(w| C)

1-

P(source)

Background words

Topic words

= Noise in feedback documents

How to Estimate F?

the 0.2a 0.1we 0.01to 0.02…text 0.0001mining 0.00005

…

KnownBackground

p(w|C)

…text =? mining =? association =?word =?

…

Unknownquery topicp(w|F)=?

“Text mining”

=0.7

=0.3

ObservedDoc(s)

Suppose, we know the identity of each word ...

MLEstimator

Can We Guess the Identity?Identity (“hidden”) variable: zi {1 (background), 0(topic)}

thepaperpresentsatextminingalgorithmthepaper...

zi

111100010...

Suppose the parameters are all known, what’s a reasonable guess of zi? - depends on (why?) - depends on p(w|C) and p(w|F) (how?)

( 1) ( | 1)( 1| )

( 1) ( | 1) ( 0) ( | 0)

( | )

( | ) (1 ) ( | )

i i ii i

i i i i i i

i

i i F

p z p w zp z w

p z p w z p z p w z

p w C

p w C p w

E-step

( , )(1 ( 1| ))( | )

( , )(1 ( 1| ))j

new i i ii F

j j iw

c w F p z wp w

c w F p z w

M-step

Initially, set p(w| F) to some random value, then iterate …

Example of Feedback Query Model

W p(W| )security 0.0558airport 0.0546

beverage 0.0488alcohol 0.0474bomb 0.0236

terrorist 0.0217author 0.0206license 0.0188bond 0.0186

counter-terror 0.0173terror 0.0142

newsnet 0.0129attack 0.0124

operation 0.0121headline 0.0121

Trec topic 412: “airport security”

W p(W| )the 0.0405

security 0.0377airport 0.0342

beverage 0.0305alcohol 0.0304

to 0.0268of 0.0241

and 0.0214author 0.0156bomb 0.0150

terrorist 0.0137in 0.0135

license 0.0127state 0.0127

by 0.0125

=0.9 =0.7

FF

Mixture model approach

Web database

Top 10 docs

Problem with Standard IR Methods:Semi-Structured Queries

• TREC-2003 Genomics Track, Topic 1:

• Problems with unstructured representation– Intuitively, matching “ATF2” should be counted more than matching

“transcription”

– Such a query is not a natural sample of a unigram language model, violating the assumption of the language modeling retrieval approach

Find articles about the following gene:

OFFICIAL_GENE_NAME activating transcription factor 2OFFICIAL_SYMBOL ATF2ALIAS_SYMBOL HB16ALIAS_SYMBOL CREB2ALIAS_SYMBOL TREB7ALIAS_SYMBOL CRE-BP1

Bag-of-word Representation:activating transcription factor 2, ATF2, HB16, CREB2, TREB7, CRE-BP1

Problem with Standard IR Methods:Semi-Structured Queries (cont.)

• A topic in TREC-2005 Genomics Track

• 3 different fields

• Should be weighted differently?

• What about expansion?

Find information about the role

of the gene interferona-beta

in the disease multiple sclerosis

Semi-Structured Language Models

1( ,..., )kQ Q QSemi-structured query

Semi-structured query model

1,..., k

1

( | ) ( | )k

Q i ii

p w p w

Semi-structured LM estimation: Fit a mixture model to pseudo feedback documents using Expectation-Maximization (EM)

Parameter Estimation

• Synonym queries:

– Each field is estimated using maximum likelihood:

– Each field has equal weights: i=1/k

• Aspect queries:

– Use top-ranked documents to estimate all the parameters

– Similar to single-aspect model, but use query as prior and Bayesian estimation

||

),()|(

i

ii Q

Qwcwp

Maximum Likelihood vs. Bayesian

• Maximum likelihood estimation

– “Best” means “data likelihood reaches maximum”

– Problem: small sample

• Bayesian estimation

– “Best” means being consistent with our “prior” knowledge and explaining data well

– Problem: how to define prior?

)|(maxargˆ

XP

)()|(maxarg)|(maxargˆ

PXPXP

Illustration of Bayesian Estimation

Prior: p()

Likelihood: p(X|)

X=(x1,…,xN)

Posterior: p(|X) p(X|)p()

: prior mode ml: ML estimate: posterior mode

Experiment Results

TREC 2003 (Uniform weights) TREC 2005 (Estimated weights)

Query Model Unstruct Semi-struct Imp. Unstruct Semi-struct Imp.

MAP 0.16 0.185 +13.5% 0.242 0.258 +6.6%

Pr@10docs 0.14 0.154 +10% 0.382 0.412 +7.8%

More Experiment Results (with slightly different model)

Conclusions

• Standard IR techniques are effective for biomedical literature retrieval

• Modeling and exploiting the structure in a query can improve accuracy

• Overall TREC Genomics Track findings

– Domain-specific resources are very useful

– Sound retrieval models and machine learning techniques are helpful

Future Work

• Using HMMs to model relevant documents

• Incorporate biomedical resources into principled statistical models

statistical language models for biomedical literature retrieval chengxiang zhai department of...

Documents