introduction to biomedical informatics information retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 1

Introduction to Biomedical Informatics

Information Retrieval


Outline• Introduction: basic concepts

• Document-query matching methods– TD-IDF– Latent semantic indexing

• Evaluation methods

• Web-scale retrieval

• Example: PubMed

• Additional Resources and Recommended Reading


Information Retrieval

• Generic task:– User has a query Q expressed in some way (e.g., set of keywords)– The user would like to find documents from some corpus D that are most

relevant to Q

• Information retrieval is the problem of automatically finding and ranking the most relevant documents in a corpus D, given a query Q

• Examples:– Q = { lung cancer smoking}, D = 20 million papers in PubMed– Q = { pizza irvine}, D = all documents on the Web


General Issues in Document Querying

– What representation language to use for docs and queries

– How to measure similarity between Q and each document in D

– How to rank the results for the user

– Allowing user feedback (query modification)

– How to evaluate and compare different IR algorithms/systems

– How to compute the results in real-time (for interactive querying)


General Concepts• Corpus D consisting of N documents

– Typically represented as an N x d matrix– Each document represented as a vector of d terms

• E.g., entry i, j is the number of times term j occurs in the document i

• Query Q:– User poses a query to search D– Query is typically expressed as a vector of d terms

• Query Q is expressed as a set of words, e.g., “data” and “mining” are both set to 1 and all other terms are set to 0

(so we can think of the query Q as a “pseudo-document”)

• Key ideas:– Represent both documents and queries as vectors in some term-space– Matching a query with documents => defining a vector similarity measure


Querying Approaches• Exact-match query: return a list of all exact matches

– Boolean match on text• query = “Irvine” AND “fun”: return all docs with “Irvine” and “fun”

– Can generalize to Boolean functions, • e.g., NOT(Irvine OR Newport Beach) AND fun

– Not so useful when there are many matches• E.g., “data mining” in Google returns millions of documents

• Ranked queries: return a ranked list of most relevant matches– e.g. what record is most similar to a query Q? – Q could itself be a full document or a shorter version (e.g., 1 or a few

words) - we will focus here on short (few word) queries

• Typical two-stage approach (e.g., in commercial search engines)– First use exact match to retrieve an initial set of documents– Then use more sophisticated similarity measures to rank documents in

this set based on how relevant they are to a query Q


“Bag of Words” Representation for Documents

• document: book, paper, WWW page, ...• term: word, word-pair, phrase, … (could be millions)• query Q = set of terms, e.g., “data” + “mining”• Full NLP (natural language processing) too hard, so …

– want (vector) representation for text which– retains maximum useful semantics– supports efficient distance computes between docs and Q

• “bag of words” ignores word order, sentence structure, etc.– Nonetheless works well in practice and widely used– Very computationally efficient compared than dealing with word order

• term values – Boolean (e.g. term in document or not); “bag of words”– real-valued (e.g. freq term in doc; relative to all docs) ...


Example of a document-term (bag-of-words) matrix

database

SQL index

regression

likelihood

linear

d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23


Practical Issues• Tokenization

– Convert document to list of word counts– word token = “any nonempty sequence of characters” – challenges: punctuation, equations, HTML, formatting, etc

• Special parsers to handle these issues

• Canonical forms, Stopwords, Stemming – Typically remove capitalization

• But capitalization may be important for proper nouns, e.g., the name “Gene” v “gene”– Stopwords:

• remove very frequent words (a, the, and…) – can use standard list• Can also remove very rare words

– Stemming (next slide)


Stemming

• Want to reduce all morphological variants of a word to a single index term– e.g. a document containing words like fish and fisher may not be retrieved by a

query containing fishing (no fishing explicitly contained in the document)

• Stemming - reduce words to their root form• e.g. fish – becomes a new index term

• Porter stemming algorithm (1980)– relies on a preconstructed suffix list with associated rules

• e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE

– BINARIZATION => BINARIZE• Not always desirable: e.g., {university, universal} -> univers (in Porter’s)

• Alternatives include WordNet which is a dictionary-based approach


Google n-gram Data

EXAMPLE OF TRIGAMS:ceramics collectables collectibles 55ceramics collectables fine 130ceramics collected by 52ceramics collectible pottery 50ceramics collectibles cooking 45ceramics collection , 144ceramics collection . 247ceramics collection </S> 120ceramics collection and 43

Number of tokens 1 trillionNumber of sentences 95 billionNumber of unigrams 13 millionNumber of bigrams 314 millionNumber of trigrams 977 millionNumber of fourgrams 1.3 billionNumber of fivegrams 1.2 billion

Data made available for research by Google in 2006


Document Similarity• Measuring similarity between 2 document term vectors x and y:

– wide variety of distance metrics:• Euclidean (L2) = sqrt(i(xi - yi)2)• L1 = I |xi - yi |• ...• weighted L2 = sqrt(i(wixi - wiyi)2)

• Cosine distance between docs:

– often gives better results than Euclidean • normalizes relative to document length

yxyxyx

T

),cos(


Distance matrices for toy document-term data

TF doc-term matrix t1 t2 t3 t4 t5 t6d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23

EuclideanDistances

CosineDistances


TF-IDF Term Weighting Schemes

• Not all terms in a query or document may be equally important...

• TF (term freq): term weight = number of times in that document– problem: term common to many docs => low discrimination, e.g., “medical”

• IDF (inverse-document frequency of a term)– nj documents contain term j, N documents in total– IDF = log(N/nj)– Favors terms that occur in relatively few documents

• TF-IDF: TF(term)*IDF(term)

• No real theoretical basis, but works very well empirically and widely used


TF-IDF Example


IDF weights are (0.1, 0.7, 0.5, 0.7, 0.4, 0.7)(using natural logs, i.e., loge)


TF-IDF Example


TF-IDF doc-term mat t1 t2 t3 t4 t5 t6d1 2.5 14.6 4.6 0 0 2.1d2 3.4 6.9 2.6 0 1.1 0d3 1.3 11.1 2.6 0 0 0d4 0.6 4.9 1.0 0 0 0d5 4.5 21.5 10.2 0 1.1 0 ...

Example: TF-IDF for term t1 in doc D1 = TF*IDF = 24 * log(10/9)

IDF weights are (0.1, 0.7, 0.5, 0.7, 0.4, 0.7)(using natural logs, i.e., loge)


Simple Document Querying System

• Queries Q = binary term vectors

• Documents represented by TF-IDF weights

• Cosine distance used for retrieval and ranking


Example of a document-term (bag-of-words) matrix

database SQL index regression likelihood linear

d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23


Representing a Query as a PseudoDocument

database SQL index regression likelihood linear

d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23

Q 1 0 1 0 0 0

Example query Q = [database index]


Example: Query Q = [t1 t3]

TF TF-IDFd1 0.70 0.32d2 0.77 0.51d3 0.58 0.24d4 0.60 0.23d5 0.79 0.43 ...


Q = (1,0,1,0,0,0)

TF-IDF doc-term mat t1 t2 t3 t4 t5 t6d1 2.5 14.6 4.6 0 0 2.1d2 3.4 6.9 2.6 0 1.1 0d3 1.3 11.1 2.6 0 0 0d4 0.6 4.9 1.0 0 0 0d5 4.5 21.5 10.2 0 1.1 0 ...


Synonymy and Polysemy

• Synonymy– the same concept can be expressed using different sets of terms

• e.g. bandit, brigand, thief– negatively affects recall (i.e., the number of relevant docs returned)

• Polysemy– identical terms can be used in very different semantic contexts

• bank• bear left at the zoo• time flies like an arrow

– negatively affects precision (i.e., the relevance of the returned docs)


Latent Semantic Indexing• Approximate data in the original d-dimensional space by data in a k-dimensional

space, where k << d

• Find the k linear projections of the data that contain the most variance– Basic approach is known as principal component analysis or singular value decomposition– Also known as “latent semantic indexing” when applied to text

• Captures dependencies among terms– In effect represents original d-dimensional basis with a k-dimensional basis– e.g., terms like SQL, indexing, query, could be approximated as coming from a single

“hidden” term

• Why is this useful?– Query contains “automobile”, document contains “vehicle”– can still match Q to the document since the 2 terms will be close in k-space (but not in

original space), i.e., addresses synonymy problem


Toy example of a document-term matrix

database

SQL index

regression

likelihood

linear

d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23


Singular Value Decomposition (SVD)

• M = U S VT

– M = n x d = original document-term matrix (the data)

– U = n x d , each row = vector of weights for each document

– S = d x d diagonal matrix of eigenvalues

– Columns of VT = new orthogonal basis for the data

– Each eigenvalue represents how much information is of the new “basis” vectors

– Typically select just the first k basis vectors, k << d

(also known as principal components, or LSI (latent semantic indexing))

optional


Example: SVD Applied to Document-Term Matrix

U1 U2

d1 30.9 -11.5

d2 30.3 -10.8

d3 18.0 -7.7

d4 8.4 -3.6

d5 52.7 -20.6

d6 14.2 21.8

d7 10.8

d8 11.5 28.0

d9 9.5 17.8

d10 19.9 45.0

database

SQL index regression likelihood linear

d1 24 21 9 0 0 3

d2 32 10 5 0 3 0

d3 12 16 5 0 0 0

d4 6 7 2 0 0 0

d5 43 31 20 0 3 0

d6 2 0 0 18 7 16

d7 0 0 1 32 12 0

d8 3 0 0 22 4 2

d9 1 0 0 34 27 25

d10 6 0 0 17 4 23


v1 = [0.74, 0.49, 0.27, 0.28, 0.18, 0.19]v2 = [-0.28, -0.24 -0.12, 0.74, 0.37, 0.31]

New documents:D1 = database x 50D2 = SQL x 50

optional


Evaluating Retrieval Methods

• Typically there no real ground truth for a query – so how can we evaluate our algorithms? Is A better than B?

• Academic research– Use small testbed data sets of documents where human labelers assign a

binary label to each document in the corpus, in terms of its relevance to a specific query Q

– repeat for different queries– very time-consuming!

• Real-world (e.g., Web search)– Can use click data as a surrogate indicator for relevancy– Can generate very large amounts of training/test data per query

• Both approaches are useful for precision, not so useful for recall


Precision versus Recall

Rank documents (numerically) with respect to query

Compute precision and recall by thresholding the rankings

precision fraction of retrieved objects that are relevant

recall fraction of retrieved relevant objects / total relevant objects

Tradeoff: high precision -> low recall, and vice-versa

Similar to receiver-operating characteristic (ROC) in concept

For multiple queries, precision for specific ranges of recall can be averaged (so-called “interpolated precision”).


Precision-Recall Curve (form of ROC)

C is universallyworse than A and B

Instead of evaluatingthe entire curve, wecan also look at, e.g.,(a) precision at fixedrecall (e.g., 10%)or (b) precision whenprecision=recall


TREC evaluations

• Text Retrieval Conference (TReC)– Web site: trec.nist.gov

• Annual impartial evaluation of IR systems– e.g., D = 1 million documents– TREC organizers supply contestants with several hundred queries Q

– Each competing system provides its ranked list of documents

– Union of top 100 ranked documents or so from each system is then manually judged to be relevant or non-relevant for each query Q

– Precision, recall, etc, then calculated and systems compared


Other Examples of Evaluation Data Sets

• Cranfield data– Number of documents = 1400– 225 Queries, “medium length”, manually constructed “test questions”– Relevance = determined by expert committee (from 1968)

• Newsgroups– Articles from 20 Usenet newsgroups– Queries = randomly selected documents– Relevance: is the document d in the same category as the query doc?


Performance on Cranfield Document Set


Performance on Newsgroups Data


Related Types of Data

• Sparse high-dimensional data sets with counts, like document-term matrices, are common in data mining, e.g.,– “transaction data”

• Rows = customers• Columns = products

– Web log data (ignoring sequence)• Rows = Web surfers• Columns = Web pages

• Recommender systems– Given some products from user i, suggest other products to the user

• e.g., Amazon.com’s book recommender– Collaborative filtering:

• use k-nearest-individuals as the basis for predictions– Many similarities with querying and information retrieval

• e.g., use of cosine distance to normalize vectors


Practical Issues: Computation Speed

• Say you are doing information retrieval at “Google-scale”– e.g., 100 billion documents in the corpus, 1 million terms in your vocabulary– So given a new query Q, you have to compute 100 billion distance calculations, each involving

1 million terms– How can this be done in “near real-time” (e.g., 200 milliseconds)

• Sparse data structures– e.g., 3 column: <docid termid count>– Vastly reduces memory requirements and speeds up search

• Inverted index – List of sorted <termid docid> pairs– useful for quickly finding only the docs that contain query terms (stage 1)

• Massively parallel processing– Different sets of docs on different processors, results then pooled, ranked


Aspects of Web-based Search

• Additional information in Web documents– Link structure (e.g., PageRank algorithm)– HTML structure

• Link/anchor text• Title text• Etc• Can be leveraged for better retrieval

• Additional issues in Web retrieval– Scalability: size of “corpus” is huge (10’s of billions of docs)– Constantly changing:

• Crawlers to update document-term information• need schemes for efficient updating indices

– Evaluation is more difficult – how is relevance measured? How many documents in total are relevant?


Example: Google Search Engine

• Offline:– Continuously crawl the Web to create index of Web documents– Create large-scale distributed inverted index

• Real-time: a user issues a query q– Parallel processing used to find documents that match exactly to q

(might be 1 million documents)– These documents are then scored based on 100 or more features

• Scoring is typically a logistic regression model learned from past search data for this query q, where 1 = user clicked on a link and 0 = no click

– Top 10 scoring links are displayed to the user• May be personalized (based on past search) and localized

– All of this has to happen in about ½ a second!


Example: PubMed System

• PubMed– Free biomedical literature search service maintained by NCBI (NIH)– Over 21 million papers indexed– Abstracts, citations, etc for over 5000 life-science journals, back to 1948– The most widely-used Web tool for searching the biomedical literature

• Several million queries per day

http://www.ncbi.nlm.nih.gov/pubmed/


PubMed Querying

• Basic query = Boolean functions of keywords– E.g., Query = (stomach OR liver) AND cancer NOT smoking– Implicit “AND’s inserted between keywords

• e.g., NOT smoking is really AND NOT smoking in the query above– Advanced search allows one to define queries on additional fields such as

author, date, journal, MESH term, language, etc

• Query is extended to include MeSH terms– If any keyword can be mapped to MeSH, then PubMed also retrieves all

documents indexed by this MeSH term

• Ranking is done in reverse chronological order

• Queries often return many docs, e.g., over 247,000 docs for query above


From, Lu, PubMed and Beyond, Database, Vol 2011Systems that go beyond PubMed


Systems that go beyond PubMedFrom, Lu, PubMed and Beyond, Database, Vol 2011


From Krallinger, Valencia, Hirschman, Genome Biology, 2008

Using Text Mining to Interpret Queries


Further Reading

See class Web page for various pointers

Information retrieval in health/biomedical contextInformation Retrieval: A Health and Biomedical Perspective, W. Hersh, Springer, 2009

Very useful reference on indexing and searching text:Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edition, Morgan Kaufmann, 1999, by Witten, Moffat, and Bell

Web-related Document Search:An excellent resource on Web-related search is Chapter 3, Web Search and Information Retrieval, in Mining the Web: Discovering Knowledge from Hypertext Data, S. Chakrabarti, Morgan Kaufmann, 2003.

Practical aspects of how real Web search engines work:http://searchenginewatch.com/

Latent Semantic AnalysisApplied to grading of essays: The debate on automated essay grading, M. Hearst et al, IEEE Intelligent Systems, September/October 2000.

http://searchenginewatch.com/

introduction to biomedical informatics information retrieval

Documents

query q hayessmyth

query q examples

vector of d terms query

millionsquery q

q information retrieval

relevant documents

set of terms

set of words