statistical language models for biomedical literature retrieval chengxiang zhai department of...

54
Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology , And Graduate School of Library & Information Science University of Illinois, Urbana-Champaign

Upload: garey-holt

Post on 14-Jan-2016

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Statistical Language Models for

Biomedical Literature Retrieval

ChengXiang Zhai

Department of Computer Science,

Institute for Genomic Biology ,

And Graduate School of Library & Information Science

University of Illinois, Urbana-Champaign

Page 2: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Motivation

• Biomedical literature serves as a “complete” documentation of the biomedical knowledge discovered by scientists

• Medline: > 10,000,000 literature abstracts (1966-)

• Effective access to biomedical literature is essential for– Understanding related existing discoveries

– Formulating new hypotheses

– Verifying hypotheses

– …

• Biologists routinely use PubMed to access literature (http://www.ncbi.nlm.nih.gov/PubMed)

Page 3: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Challenges in Biomedical Literature Retrieval

• Tokenization

– Many names are irregular with special characters such as “/”, “-”, etc. E.g., MIP-1-alpha, (MIP)-1alpha

– Ambiguous words: “was” and “as” can be genes

• Semi-structured queries

– It is often desirable to expand a query about a gene with synonyms of the gene; the expanded query would have several fields (original name + symbols)

– “Find the role of gene A in disease B” (3 fields)

• …

Page 4: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

TREC Genomics Track

• TREC (Text REtrieval Conference):

– Started 1992; sponsored by NIST

– Large-scale evaluation of information retrieval (IR) techniques

• Genomics Track

– Started in 2003

– Still continuing

– Evaluation of IR for biomedical literature search

Page 5: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Typical TREC Cycle

• Feb: Application for participation

• Spring: Preliminary (training) data available

• Beginning of Summer: Official test data available

• End of Summer: Result submission

• Early Fall: Official evaluation; results are out in Oct

• Nov: TREC Workshop; plan for next year

Page 6: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

UIUC Participation

• 2003: Obtained initial experience; recognized the problem of “semi-structured queries”

• 2005: Continued developing semi-structured language models

• 2006: Applied hidden Markov models to passage retrieval

Page 7: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Outline

• Standard IR Techniques

• Semi-structured Query Language Models

• Parameter Estimation

• Experiment Results

• Conclusions and Future Work

Page 8: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

What is Text Retrieval (TR)?

• There exists a collection of text documents

• User gives a query to express the information need

• A retrieval system returns relevant documents to users

• More commonly known as “Information Retrieval” (IR)

• Known as “search technology” in industry

Page 9: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

TR is Hard!

• Under/over-specified query

– Ambiguous: “buying CDs” (money or music?)

– Incomplete: what kind of CDs?

– What if “CD” is never mentioned in document?

• Vague semantics of documents

– Ambiguity: e.g., word-sense, structural

– Incomplete: Inferences required

• Even hard for people!

– 80% agreement in human judgments

Page 10: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

TR is “Easy”!

• TR CAN be easy in a particular case

– Ambiguity in query/document is RELATIVE to the database

– So, if the query is SPECIFIC enough, just one keyword may get all the relevant documents

• PERCEIVED TR performance is usually better than the actual performance

– Users can NOT judge the completeness of an answer

Page 11: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Formal Formulation of TR

• Vocabulary V={w1, w2, …, wN} of language

• Query q = q1,…,qm, where qi V

• Document di = di1,…,dimi, where dij V

• Collection C= {d1, …, dk}

• Set of relevant documents R(q) C

– Generally unknown and user-dependent

– Query is a “hint” on which doc is in R(q)

• Task = compute R’(q), an “approximate R(q)”

Page 12: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Computing R(q)

• Strategy 1: Document selection

– R(q)={dC|f(d,q)=1}, where f(d,q) {0,1} is an indicator function or classifier

– System must decide if a doc is relevant or not (“absolute relevance”)

• Strategy 2: Document ranking

– R(q) = {dC|f(d,q)>}, where f(d,q) is a relevance measure function; is a cutoff

– System must decide if one doc is more likely to be relevant than another (“relative relevance”)

Page 13: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Document Selection vs. Ranking

++

+ +-- -

- - - -

- - - -

-

- - +- -

Doc Selectionf(d,q)=?

++

++

--+

-+

--

- --

---

Doc Rankingf(d,q)=?

1

0

0.98 d1 +0.95 d2 +0.83 d3 -0.80 d4 +0.76 d5 -0.56 d6 -0.34 d7 -0.21 d8 +0.21 d9 -

R’(q)

R’(q)

True R(q)

Page 14: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Problems of Doc Selection

• The classifier is unlikely accurate

– “Over-constrained” query (terms are too specific): no relevant documents found

– “Under-constrained” query (terms are too general): over delivery

– It is extremely hard to find the right position between these two extremes

• Even if it is accurate, all relevant documents are not equally relevant

• Relevance is a matter of degree!

Page 15: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Ranking is often preferred

• Relevance is a matter of degree

• A user can stop browsing anywhere, so the boundary is controlled by the user

– High recall users would view more items

– High precision users would view only a few

• Theoretical justification: Probability Ranking Principle [Robertson 77]

Page 16: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Evaluation Criteria

• Effectiveness/Accuracy

– Precision, Recall

• Efficiency

– Space and time complexity

• Usability

– How useful for real user tasks?

Page 17: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Methodology: Cranfield Tradition

• Laboratory testing of system components

– Precision, Recall

– Comparative testing

• Test collections

– Set of documents

– Set of questions

– Relevance judgments

Page 18: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

The Contingency Table

Relevant Retrieved

Irrelevant Retrieved Irrelevant Rejected

Relevant RejectedRelevant

Not relevant

Retrieved Not RetrievedDocAction

Relevant

RetrievedRelevant Recall

Retrieved

RetrievedRelevant Precision

Page 19: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

How to measure a ranking?

• Compute the precision at every recall point

• Plot a precision-recall (PR) curve

precision

recall

x

x

x

x

precision

recall

x

x

x

x

Which is better?

Page 20: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Summarize a Ranking• Given that n docs are retrieved

– Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs

– E.g., if the first rel. doc is at the 2nd rank, then p(1)=1/2.

– If a relevant document never gets retrieved, we assume the precision corresponding to that rel. doc to be zero

• Compute the average over all the relevant documents– Average precision = (p(1)+…p(k))/k

• This gives us (non-interpolated) average precision, which captures both precision and recall and is sensitive to the rank of each relevant document

• Mean Average Precisions (MAP)– MAP = arithmetic mean average precision over a set of topics

– gMAP = geometric mean average precision over a set of topics (more affected by difficult topics)

Page 21: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Precion-Recall Curve

Mean Avg. Precision (MAP)

Recall=3212/4728

Breakeven Point (prec=recall)

Out of 4728 rel docs, we’ve got 3212

D1 +D2 +D3 –D4 –D5 +D6 -

Total # rel docs = 4System returns 6 docs

Average Prec = (1/1+2/2+3/5+0)/4

about 5.5 docsin the top 10 docs

are relevant

Precision@10docs

Page 22: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Typical TR System Architecture

User

querydocs

results

Query Rep

Doc Rep (Index)

ScorerIndexer

Tokenizer

Index

judgmentsFeedback

Page 23: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Tokenization

• Normalize lexical units: Words with similar meanings should be mapped to the same indexing term

• Stemming: Mapping all inflectional forms of words to the same root form, e.g.

– computer -> compute

– computation -> compute

– computing -> compute (but king->k?)

• Porter’s Stemmer is popular for English

Page 24: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Relevance Feedback

Updatedquery

Feedback

Judgments:d1 +d2 -d3 +

…dk -...

Query RetrievalEngine

Results:d1 3.5d2 2.4…dk 0.5...

UserDocumentcollection

Page 25: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Pseudo/Blind/Automatic Feedback

Query RetrievalEngine

Results:d1 3.5d2 2.4…dk 0.5...

Judgments:d1 +d2 +d3 +

…dk -...

Documentcollection

Feedback

Updatedquery

top 10

Page 26: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Traditional approach = Vector space model

Page 27: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Vector Space Model

• Represent a doc/query by a term vector

– Term: basic concept, e.g., word or phrase

– Each term defines one dimension

– N terms define a high-dimensional space

– Element of vector corresponds to term weight

– E.g., d=(x1,…,xN), xi is “importance” of term i

• Measure relevance by the distance between the query vector and document vector in the vector space

Page 28: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

VS Model: illustration

Java

Microsoft

Starbucks

D6

D10

D9

D4

D7

D8

D5

D11

D2 ? ?

D1

? ?

D3

? ?

Query

Page 29: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

What’s a good “basic concept”?

• Orthogonal

– Linearly independent basis vectors

– “Non-overlapping” in meaning

• No ambiguity

• Weights can be assigned automatically and hopefully accurately

• Many possibilities: Words, stemmed words, phrases, “latent concept”, …

Page 30: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

How to Assign Weights?

• Very important!

• Why weighting– Query side: Not all terms are equally important

– Doc side: Some terms carry more contents

• How?

– Two basic heuristics

• TF (Term Frequency) = Within-doc-frequency

• IDF (Inverse Document Frequency)

– TF normalization

Page 31: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Language Modeling Approaches are becoming more and more popular…

Page 32: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

What is a Statistical LM?

• A probability distribution over word sequences

– p(“Today is Wednesday”) 0.001

– p(“Today Wednesday is”) 0.0000000000001

– p(“The eigenvalue is positive”) 0.00001

• Context-dependent!

• Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

Page 33: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

The Simplest Language Model(Unigram Model)

• Generate a piece of text by generating each word INDEPENDENTLY

• Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)

• Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size)

• Essentially a multinomial distribution over words

• A piece of text can be regarded as a sample drawn according to this word distribution

Page 34: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Text Generation with Unigram LM

(Unigram) Language Model p(w| )

…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001

Topic 1:Text mining

…food 0.25nutrition 0.1healthy 0.05diet 0.02

Topic 2:Health

Document

Text miningpaper

Food nutritionpaper

Sampling

Page 35: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Estimation of Unigram LM

(Unigram) Language Model p(w| )=?

Document

text 10mining 5

association 3database 3algorithm 2

…query 1

efficient 1

…text ?mining ?assocation ?database ?…query ?

Estimation

A “text mining paper”(total #words=100)

10/1005/1003/1003/100

1/100

Page 36: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Language Models for Retrieval(Ponte & Croft 98)

Document

Text miningpaper

Food nutritionpaper

Language Model

…text ?mining ?assocation ?clustering ?…food ?

…food ?nutrition ?healthy ?diet ?

Query = “data mining algorithms”

? Which model would most likely have generated this query?

Page 37: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Ranking Docs by Query Likelihood

d1

d2

dN

qd1

d2

dN

Doc LM

p(q| d1)

p(q| d2)

p(q| dN)

Query likelihood

Page 38: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Kullback-Leibler (KL) Divergence Retrieval Model

• Unigram similarity model

• Retrieval Estimation of Q and D

• Special case: = empirical distribution of q

ˆ ˆ( ; ) ( || )

ˆ ˆ ˆ ˆ( | ) log ( | ) ( ( | ) log ( | ))

Q D

Q D Q Qw w

Sim d q D

p w p w p w p w

query entropy(ignored for ranking)

( | ) 0

( | )ˆ( ; ) [ ( | ) log ] log( | )

i

i Q

seen ii Q d

w d d ip w

p w dsim q d p w

p w C

Page 39: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Estimating p(w|d) (i.e., D)

• Simplified Jelinek-Mercer: Shrink uniformly toward p(w|C)

)|()|()()|( Cwpdwpdwp ml 1

)|()|()|( ||||||

||)|();( Cwpdwpdwp dmld

dd

Cwpdwc

• Dirichlet prior (Bayesian): Assume pseudo counts p(w|C)

• Absolute discounting: Subtract a constant

||)|(||)0,);(max()|( d

Cwpddwc udwp

Page 40: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Estimating Q (Feedback)

Query Q

D

)||( DQD

Document D

Results

Feedback Docs F={d1, d2 , …, dn}

FQQ )1('

Generative model

Q

F=0

No feedback

FQ '

=1

Full feedback

QQ '

Page 41: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Generative Mixture Model

w

w

F={d1, …, dn}

log ( | ) ( ; ) log[(1 ) ( | ) ( | )]ii w

p F c w d p w p w C )|(logmaxarg

FpF Maximum Likelihood

P(w| )

P(w| C)

1-

P(source)

Background words

Topic words

= Noise in feedback documents

Page 42: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

How to Estimate F?

the 0.2a 0.1we 0.01to 0.02…text 0.0001mining 0.00005

KnownBackground

p(w|C)

…text =? mining =? association =?word =?

Unknownquery topicp(w|F)=?

“Text mining”

=0.7

=0.3

ObservedDoc(s)

Suppose, we know the identity of each word ...

MLEstimator

Page 43: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Can We Guess the Identity?Identity (“hidden”) variable: zi {1 (background), 0(topic)}

thepaperpresentsatextminingalgorithmthepaper...

zi

111100010...

Suppose the parameters are all known, what’s a reasonable guess of zi? - depends on (why?) - depends on p(w|C) and p(w|F) (how?)

( 1) ( | 1)( 1| )

( 1) ( | 1) ( 0) ( | 0)

( | )

( | ) (1 ) ( | )

i i ii i

i i i i i i

i

i i F

p z p w zp z w

p z p w z p z p w z

p w C

p w C p w

E-step

( , )(1 ( 1| ))( | )

( , )(1 ( 1| ))j

new i i ii F

j j iw

c w F p z wp w

c w F p z w

M-step

Initially, set p(w| F) to some random value, then iterate …

Page 44: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Example of Feedback Query Model

W p(W| )security 0.0558airport 0.0546

beverage 0.0488alcohol 0.0474bomb 0.0236

terrorist 0.0217author 0.0206license 0.0188bond 0.0186

counter-terror 0.0173terror 0.0142

newsnet 0.0129attack 0.0124

operation 0.0121headline 0.0121

Trec topic 412: “airport security”

W p(W| )the 0.0405

security 0.0377airport 0.0342

beverage 0.0305alcohol 0.0304

to 0.0268of 0.0241

and 0.0214author 0.0156bomb 0.0150

terrorist 0.0137in 0.0135

license 0.0127state 0.0127

by 0.0125

=0.9 =0.7

FF

Mixture model approach

Web database

Top 10 docs

Page 45: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Problem with Standard IR Methods:Semi-Structured Queries

• TREC-2003 Genomics Track, Topic 1:

• Problems with unstructured representation– Intuitively, matching “ATF2” should be counted more than matching

“transcription”

– Such a query is not a natural sample of a unigram language model, violating the assumption of the language modeling retrieval approach

Find articles about the following gene:

OFFICIAL_GENE_NAME activating transcription factor 2OFFICIAL_SYMBOL ATF2ALIAS_SYMBOL HB16ALIAS_SYMBOL CREB2ALIAS_SYMBOL TREB7ALIAS_SYMBOL CRE-BP1

Bag-of-word Representation:activating transcription factor 2, ATF2, HB16, CREB2, TREB7, CRE-BP1

Page 46: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Problem with Standard IR Methods:Semi-Structured Queries (cont.)

• A topic in TREC-2005 Genomics Track

• 3 different fields

• Should be weighted differently?

• What about expansion?

Find information about the role

of the gene interferona-beta

in the disease multiple sclerosis

Page 47: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Semi-Structured Language Models

1( ,..., )kQ Q QSemi-structured query

Semi-structured query model

1,..., k

1

( | ) ( | )k

Q i ii

p w p w

Semi-structured LM estimation: Fit a mixture model to pseudo feedback documents using Expectation-Maximization (EM)

Page 48: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Parameter Estimation

• Synonym queries:

– Each field is estimated using maximum likelihood:

– Each field has equal weights: i=1/k

• Aspect queries:

– Use top-ranked documents to estimate all the parameters

– Similar to single-aspect model, but use query as prior and Bayesian estimation

||

),()|(

i

ii Q

Qwcwp

Page 49: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Maximum Likelihood vs. Bayesian

• Maximum likelihood estimation

– “Best” means “data likelihood reaches maximum”

– Problem: small sample

• Bayesian estimation

– “Best” means being consistent with our “prior” knowledge and explaining data well

– Problem: how to define prior?

)|(maxargˆ

XP

)()|(maxarg)|(maxargˆ

PXPXP

Page 50: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Illustration of Bayesian Estimation

Prior: p()

Likelihood: p(X|)

X=(x1,…,xN)

Posterior: p(|X) p(X|)p()

: prior mode ml: ML estimate: posterior mode

Page 51: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Experiment Results

TREC 2003 (Uniform weights) TREC 2005 (Estimated weights)

Query Model Unstruct Semi-struct Imp. Unstruct Semi-struct Imp.

MAP 0.16 0.185 +13.5% 0.242 0.258 +6.6%

Pr@10docs 0.14 0.154 +10% 0.382 0.412 +7.8%

Page 52: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

More Experiment Results (with slightly different model)

Page 53: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Conclusions

• Standard IR techniques are effective for biomedical literature retrieval

• Modeling and exploiting the structure in a query can improve accuracy

• Overall TREC Genomics Track findings

– Domain-specific resources are very useful

– Sound retrieval models and machine learning techniques are helpful

Page 54: Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate

Future Work

• Using HMMs to model relevant documents

• Incorporate biomedical resources into principled statistical models