statistical language models for biomedical literature retrieval chengxiang zhai department of...
TRANSCRIPT
Statistical Language Models for
Biomedical Literature Retrieval
ChengXiang Zhai
Department of Computer Science,
Institute for Genomic Biology ,
And Graduate School of Library & Information Science
University of Illinois, Urbana-Champaign
Motivation
• Biomedical literature serves as a “complete” documentation of the biomedical knowledge discovered by scientists
• Medline: > 10,000,000 literature abstracts (1966-)
• Effective access to biomedical literature is essential for– Understanding related existing discoveries
– Formulating new hypotheses
– Verifying hypotheses
– …
• Biologists routinely use PubMed to access literature (http://www.ncbi.nlm.nih.gov/PubMed)
Challenges in Biomedical Literature Retrieval
• Tokenization
– Many names are irregular with special characters such as “/”, “-”, etc. E.g., MIP-1-alpha, (MIP)-1alpha
– Ambiguous words: “was” and “as” can be genes
• Semi-structured queries
– It is often desirable to expand a query about a gene with synonyms of the gene; the expanded query would have several fields (original name + symbols)
– “Find the role of gene A in disease B” (3 fields)
• …
TREC Genomics Track
• TREC (Text REtrieval Conference):
– Started 1992; sponsored by NIST
– Large-scale evaluation of information retrieval (IR) techniques
• Genomics Track
– Started in 2003
– Still continuing
– Evaluation of IR for biomedical literature search
Typical TREC Cycle
• Feb: Application for participation
• Spring: Preliminary (training) data available
• Beginning of Summer: Official test data available
• End of Summer: Result submission
• Early Fall: Official evaluation; results are out in Oct
• Nov: TREC Workshop; plan for next year
UIUC Participation
• 2003: Obtained initial experience; recognized the problem of “semi-structured queries”
• 2005: Continued developing semi-structured language models
• 2006: Applied hidden Markov models to passage retrieval
Outline
• Standard IR Techniques
• Semi-structured Query Language Models
• Parameter Estimation
• Experiment Results
• Conclusions and Future Work
What is Text Retrieval (TR)?
• There exists a collection of text documents
• User gives a query to express the information need
• A retrieval system returns relevant documents to users
• More commonly known as “Information Retrieval” (IR)
• Known as “search technology” in industry
TR is Hard!
• Under/over-specified query
– Ambiguous: “buying CDs” (money or music?)
– Incomplete: what kind of CDs?
– What if “CD” is never mentioned in document?
• Vague semantics of documents
– Ambiguity: e.g., word-sense, structural
– Incomplete: Inferences required
• Even hard for people!
– 80% agreement in human judgments
TR is “Easy”!
• TR CAN be easy in a particular case
– Ambiguity in query/document is RELATIVE to the database
– So, if the query is SPECIFIC enough, just one keyword may get all the relevant documents
• PERCEIVED TR performance is usually better than the actual performance
– Users can NOT judge the completeness of an answer
Formal Formulation of TR
• Vocabulary V={w1, w2, …, wN} of language
• Query q = q1,…,qm, where qi V
• Document di = di1,…,dimi, where dij V
• Collection C= {d1, …, dk}
• Set of relevant documents R(q) C
– Generally unknown and user-dependent
– Query is a “hint” on which doc is in R(q)
• Task = compute R’(q), an “approximate R(q)”
Computing R(q)
• Strategy 1: Document selection
– R(q)={dC|f(d,q)=1}, where f(d,q) {0,1} is an indicator function or classifier
– System must decide if a doc is relevant or not (“absolute relevance”)
• Strategy 2: Document ranking
– R(q) = {dC|f(d,q)>}, where f(d,q) is a relevance measure function; is a cutoff
– System must decide if one doc is more likely to be relevant than another (“relative relevance”)
Document Selection vs. Ranking
++
+ +-- -
- - - -
- - - -
-
- - +- -
Doc Selectionf(d,q)=?
++
++
--+
-+
--
- --
---
Doc Rankingf(d,q)=?
1
0
0.98 d1 +0.95 d2 +0.83 d3 -0.80 d4 +0.76 d5 -0.56 d6 -0.34 d7 -0.21 d8 +0.21 d9 -
R’(q)
R’(q)
True R(q)
Problems of Doc Selection
• The classifier is unlikely accurate
– “Over-constrained” query (terms are too specific): no relevant documents found
– “Under-constrained” query (terms are too general): over delivery
– It is extremely hard to find the right position between these two extremes
• Even if it is accurate, all relevant documents are not equally relevant
• Relevance is a matter of degree!
Ranking is often preferred
• Relevance is a matter of degree
• A user can stop browsing anywhere, so the boundary is controlled by the user
– High recall users would view more items
– High precision users would view only a few
• Theoretical justification: Probability Ranking Principle [Robertson 77]
Evaluation Criteria
• Effectiveness/Accuracy
– Precision, Recall
• Efficiency
– Space and time complexity
• Usability
– How useful for real user tasks?
Methodology: Cranfield Tradition
• Laboratory testing of system components
– Precision, Recall
– Comparative testing
• Test collections
– Set of documents
– Set of questions
– Relevance judgments
The Contingency Table
Relevant Retrieved
Irrelevant Retrieved Irrelevant Rejected
Relevant RejectedRelevant
Not relevant
Retrieved Not RetrievedDocAction
Relevant
RetrievedRelevant Recall
Retrieved
RetrievedRelevant Precision
How to measure a ranking?
• Compute the precision at every recall point
• Plot a precision-recall (PR) curve
precision
recall
x
x
x
x
precision
recall
x
x
x
x
Which is better?
Summarize a Ranking• Given that n docs are retrieved
– Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs
– E.g., if the first rel. doc is at the 2nd rank, then p(1)=1/2.
– If a relevant document never gets retrieved, we assume the precision corresponding to that rel. doc to be zero
• Compute the average over all the relevant documents– Average precision = (p(1)+…p(k))/k
• This gives us (non-interpolated) average precision, which captures both precision and recall and is sensitive to the rank of each relevant document
• Mean Average Precisions (MAP)– MAP = arithmetic mean average precision over a set of topics
– gMAP = geometric mean average precision over a set of topics (more affected by difficult topics)
Precion-Recall Curve
Mean Avg. Precision (MAP)
Recall=3212/4728
Breakeven Point (prec=recall)
Out of 4728 rel docs, we’ve got 3212
D1 +D2 +D3 –D4 –D5 +D6 -
Total # rel docs = 4System returns 6 docs
Average Prec = (1/1+2/2+3/5+0)/4
about 5.5 docsin the top 10 docs
are relevant
Precision@10docs
Typical TR System Architecture
User
querydocs
results
Query Rep
Doc Rep (Index)
ScorerIndexer
Tokenizer
Index
judgmentsFeedback
Tokenization
• Normalize lexical units: Words with similar meanings should be mapped to the same indexing term
• Stemming: Mapping all inflectional forms of words to the same root form, e.g.
– computer -> compute
– computation -> compute
– computing -> compute (but king->k?)
• Porter’s Stemmer is popular for English
Relevance Feedback
Updatedquery
Feedback
Judgments:d1 +d2 -d3 +
…dk -...
Query RetrievalEngine
Results:d1 3.5d2 2.4…dk 0.5...
UserDocumentcollection
Pseudo/Blind/Automatic Feedback
Query RetrievalEngine
Results:d1 3.5d2 2.4…dk 0.5...
Judgments:d1 +d2 +d3 +
…dk -...
Documentcollection
Feedback
Updatedquery
top 10
Traditional approach = Vector space model
Vector Space Model
• Represent a doc/query by a term vector
– Term: basic concept, e.g., word or phrase
– Each term defines one dimension
– N terms define a high-dimensional space
– Element of vector corresponds to term weight
– E.g., d=(x1,…,xN), xi is “importance” of term i
• Measure relevance by the distance between the query vector and document vector in the vector space
VS Model: illustration
Java
Microsoft
Starbucks
D6
D10
D9
D4
D7
D8
D5
D11
D2 ? ?
D1
? ?
D3
? ?
Query
What’s a good “basic concept”?
• Orthogonal
– Linearly independent basis vectors
– “Non-overlapping” in meaning
• No ambiguity
• Weights can be assigned automatically and hopefully accurately
• Many possibilities: Words, stemmed words, phrases, “latent concept”, …
How to Assign Weights?
• Very important!
• Why weighting– Query side: Not all terms are equally important
– Doc side: Some terms carry more contents
• How?
– Two basic heuristics
• TF (Term Frequency) = Within-doc-frequency
• IDF (Inverse Document Frequency)
– TF normalization
Language Modeling Approaches are becoming more and more popular…
What is a Statistical LM?
• A probability distribution over word sequences
– p(“Today is Wednesday”) 0.001
– p(“Today Wednesday is”) 0.0000000000001
– p(“The eigenvalue is positive”) 0.00001
• Context-dependent!
• Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model
The Simplest Language Model(Unigram Model)
• Generate a piece of text by generating each word INDEPENDENTLY
• Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)
• Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size)
• Essentially a multinomial distribution over words
• A piece of text can be regarded as a sample drawn according to this word distribution
Text Generation with Unigram LM
(Unigram) Language Model p(w| )
…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001
…
Topic 1:Text mining
…food 0.25nutrition 0.1healthy 0.05diet 0.02
…
Topic 2:Health
Document
Text miningpaper
Food nutritionpaper
Sampling
Estimation of Unigram LM
(Unigram) Language Model p(w| )=?
Document
text 10mining 5
association 3database 3algorithm 2
…query 1
efficient 1
…text ?mining ?assocation ?database ?…query ?
…
Estimation
A “text mining paper”(total #words=100)
10/1005/1003/1003/100
1/100
Language Models for Retrieval(Ponte & Croft 98)
Document
Text miningpaper
Food nutritionpaper
Language Model
…text ?mining ?assocation ?clustering ?…food ?
…
…food ?nutrition ?healthy ?diet ?
…
Query = “data mining algorithms”
? Which model would most likely have generated this query?
Ranking Docs by Query Likelihood
d1
d2
dN
qd1
d2
dN
Doc LM
p(q| d1)
p(q| d2)
p(q| dN)
Query likelihood
Kullback-Leibler (KL) Divergence Retrieval Model
• Unigram similarity model
• Retrieval Estimation of Q and D
• Special case: = empirical distribution of q
ˆ ˆ( ; ) ( || )
ˆ ˆ ˆ ˆ( | ) log ( | ) ( ( | ) log ( | ))
Q D
Q D Q Qw w
Sim d q D
p w p w p w p w
query entropy(ignored for ranking)
Q̂
( | ) 0
( | )ˆ( ; ) [ ( | ) log ] log( | )
i
i Q
seen ii Q d
w d d ip w
p w dsim q d p w
p w C
Estimating p(w|d) (i.e., D)
• Simplified Jelinek-Mercer: Shrink uniformly toward p(w|C)
)|()|()()|( Cwpdwpdwp ml 1
)|()|()|( ||||||
||)|();( Cwpdwpdwp dmld
dd
Cwpdwc
• Dirichlet prior (Bayesian): Assume pseudo counts p(w|C)
• Absolute discounting: Subtract a constant
||)|(||)0,);(max()|( d
Cwpddwc udwp
Estimating Q (Feedback)
Query Q
D
)||( DQD
Document D
Results
Feedback Docs F={d1, d2 , …, dn}
FQQ )1('
Generative model
Q
F=0
No feedback
FQ '
=1
Full feedback
QQ '
Generative Mixture Model
w
w
F={d1, …, dn}
log ( | ) ( ; ) log[(1 ) ( | ) ( | )]ii w
p F c w d p w p w C )|(logmaxarg
FpF Maximum Likelihood
P(w| )
P(w| C)
1-
P(source)
Background words
Topic words
= Noise in feedback documents
How to Estimate F?
the 0.2a 0.1we 0.01to 0.02…text 0.0001mining 0.00005
…
KnownBackground
p(w|C)
…text =? mining =? association =?word =?
…
Unknownquery topicp(w|F)=?
“Text mining”
=0.7
=0.3
ObservedDoc(s)
Suppose, we know the identity of each word ...
MLEstimator
Can We Guess the Identity?Identity (“hidden”) variable: zi {1 (background), 0(topic)}
thepaperpresentsatextminingalgorithmthepaper...
zi
111100010...
Suppose the parameters are all known, what’s a reasonable guess of zi? - depends on (why?) - depends on p(w|C) and p(w|F) (how?)
( 1) ( | 1)( 1| )
( 1) ( | 1) ( 0) ( | 0)
( | )
( | ) (1 ) ( | )
i i ii i
i i i i i i
i
i i F
p z p w zp z w
p z p w z p z p w z
p w C
p w C p w
E-step
( , )(1 ( 1| ))( | )
( , )(1 ( 1| ))j
new i i ii F
j j iw
c w F p z wp w
c w F p z w
M-step
Initially, set p(w| F) to some random value, then iterate …
Example of Feedback Query Model
W p(W| )security 0.0558airport 0.0546
beverage 0.0488alcohol 0.0474bomb 0.0236
terrorist 0.0217author 0.0206license 0.0188bond 0.0186
counter-terror 0.0173terror 0.0142
newsnet 0.0129attack 0.0124
operation 0.0121headline 0.0121
Trec topic 412: “airport security”
W p(W| )the 0.0405
security 0.0377airport 0.0342
beverage 0.0305alcohol 0.0304
to 0.0268of 0.0241
and 0.0214author 0.0156bomb 0.0150
terrorist 0.0137in 0.0135
license 0.0127state 0.0127
by 0.0125
=0.9 =0.7
FF
Mixture model approach
Web database
Top 10 docs
Problem with Standard IR Methods:Semi-Structured Queries
• TREC-2003 Genomics Track, Topic 1:
• Problems with unstructured representation– Intuitively, matching “ATF2” should be counted more than matching
“transcription”
– Such a query is not a natural sample of a unigram language model, violating the assumption of the language modeling retrieval approach
Find articles about the following gene:
OFFICIAL_GENE_NAME activating transcription factor 2OFFICIAL_SYMBOL ATF2ALIAS_SYMBOL HB16ALIAS_SYMBOL CREB2ALIAS_SYMBOL TREB7ALIAS_SYMBOL CRE-BP1
Bag-of-word Representation:activating transcription factor 2, ATF2, HB16, CREB2, TREB7, CRE-BP1
Problem with Standard IR Methods:Semi-Structured Queries (cont.)
• A topic in TREC-2005 Genomics Track
• 3 different fields
• Should be weighted differently?
• What about expansion?
Find information about the role
of the gene interferona-beta
in the disease multiple sclerosis
Semi-Structured Language Models
1( ,..., )kQ Q QSemi-structured query
Semi-structured query model
1,..., k
1
( | ) ( | )k
Q i ii
p w p w
Semi-structured LM estimation: Fit a mixture model to pseudo feedback documents using Expectation-Maximization (EM)
Parameter Estimation
• Synonym queries:
– Each field is estimated using maximum likelihood:
– Each field has equal weights: i=1/k
• Aspect queries:
– Use top-ranked documents to estimate all the parameters
– Similar to single-aspect model, but use query as prior and Bayesian estimation
||
),()|(
i
ii Q
Qwcwp
Maximum Likelihood vs. Bayesian
• Maximum likelihood estimation
– “Best” means “data likelihood reaches maximum”
– Problem: small sample
• Bayesian estimation
– “Best” means being consistent with our “prior” knowledge and explaining data well
– Problem: how to define prior?
)|(maxargˆ
XP
)()|(maxarg)|(maxargˆ
PXPXP
Illustration of Bayesian Estimation
Prior: p()
Likelihood: p(X|)
X=(x1,…,xN)
Posterior: p(|X) p(X|)p()
: prior mode ml: ML estimate: posterior mode
Experiment Results
TREC 2003 (Uniform weights) TREC 2005 (Estimated weights)
Query Model Unstruct Semi-struct Imp. Unstruct Semi-struct Imp.
MAP 0.16 0.185 +13.5% 0.242 0.258 +6.6%
Pr@10docs 0.14 0.154 +10% 0.382 0.412 +7.8%
More Experiment Results (with slightly different model)
Conclusions
• Standard IR techniques are effective for biomedical literature retrieval
• Modeling and exploiting the structure in a query can improve accuracy
• Overall TREC Genomics Track findings
– Domain-specific resources are very useful
– Sound retrieval models and machine learning techniques are helpful
Future Work
• Using HMMs to model relevant documents
• Incorporate biomedical resources into principled statistical models