introduction to biomedical informatics information retrieval

46
© Hayes/Smyth: Introduction to Biomedical I Introduction to Biomedical Informatics Information Retrieval

Upload: hera

Post on 24-Feb-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Introduction to Biomedical Informatics Information Retrieval . Outline. Introduction: basic concepts Document-query matching methods TD-IDF Latent semantic indexing Evaluation methods Web-scale retrieval Example: PubMed Additional Resources and Recommended Reading. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 1

Introduction to Biomedical Informatics

Information Retrieval

Page 2: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 2

Outline• Introduction: basic concepts

• Document-query matching methods– TD-IDF– Latent semantic indexing

• Evaluation methods

• Web-scale retrieval

• Example: PubMed

• Additional Resources and Recommended Reading

Page 3: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 3

Information Retrieval

• Generic task:– User has a query Q expressed in some way (e.g., set of keywords)– The user would like to find documents from some corpus D that are most

relevant to Q

• Information retrieval is the problem of automatically finding and ranking the most relevant documents in a corpus D, given a query Q

• Examples:– Q = { lung cancer smoking}, D = 20 million papers in PubMed– Q = { pizza irvine}, D = all documents on the Web

Page 4: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 4

General Issues in Document Querying

– What representation language to use for docs and queries

– How to measure similarity between Q and each document in D

– How to rank the results for the user

– Allowing user feedback (query modification)

– How to evaluate and compare different IR algorithms/systems

– How to compute the results in real-time (for interactive querying)

Page 5: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 5

General Concepts• Corpus D consisting of N documents

– Typically represented as an N x d matrix– Each document represented as a vector of d terms

• E.g., entry i, j is the number of times term j occurs in the document i

• Query Q:– User poses a query to search D– Query is typically expressed as a vector of d terms

• Query Q is expressed as a set of words, e.g., “data” and “mining” are both set to 1 and all other terms are set to 0

(so we can think of the query Q as a “pseudo-document”)

• Key ideas:– Represent both documents and queries as vectors in some term-space– Matching a query with documents => defining a vector similarity measure

Page 6: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 6

Querying Approaches• Exact-match query: return a list of all exact matches

– Boolean match on text• query = “Irvine” AND “fun”: return all docs with “Irvine” and “fun”

– Can generalize to Boolean functions, • e.g., NOT(Irvine OR Newport Beach) AND fun

– Not so useful when there are many matches• E.g., “data mining” in Google returns millions of documents

• Ranked queries: return a ranked list of most relevant matches– e.g. what record is most similar to a query Q? – Q could itself be a full document or a shorter version (e.g., 1 or a few

words) - we will focus here on short (few word) queries

• Typical two-stage approach (e.g., in commercial search engines)– First use exact match to retrieve an initial set of documents– Then use more sophisticated similarity measures to rank documents in

this set based on how relevant they are to a query Q

Page 7: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 7

“Bag of Words” Representation for Documents

• document: book, paper, WWW page, ...• term: word, word-pair, phrase, … (could be millions)• query Q = set of terms, e.g., “data” + “mining”• Full NLP (natural language processing) too hard, so …

– want (vector) representation for text which– retains maximum useful semantics– supports efficient distance computes between docs and Q

• “bag of words” ignores word order, sentence structure, etc.– Nonetheless works well in practice and widely used– Very computationally efficient compared than dealing with word order

• term values – Boolean (e.g. term in document or not); “bag of words”– real-valued (e.g. freq term in doc; relative to all docs) ...

Page 8: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 8

Example of a document-term (bag-of-words) matrix

database

SQL index

regression

likelihood

linear

d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23

Page 9: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 9

Practical Issues• Tokenization

– Convert document to list of word counts– word token = “any nonempty sequence of characters” – challenges: punctuation, equations, HTML, formatting, etc

• Special parsers to handle these issues

• Canonical forms, Stopwords, Stemming – Typically remove capitalization

• But capitalization may be important for proper nouns, e.g., the name “Gene” v “gene”– Stopwords:

• remove very frequent words (a, the, and…) – can use standard list• Can also remove very rare words

– Stemming (next slide)

Page 10: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 10

Stemming

• Want to reduce all morphological variants of a word to a single index term– e.g. a document containing words like fish and fisher may not be retrieved by a

query containing fishing (no fishing explicitly contained in the document)

• Stemming - reduce words to their root form• e.g. fish – becomes a new index term

• Porter stemming algorithm (1980)– relies on a preconstructed suffix list with associated rules

• e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE

– BINARIZATION => BINARIZE• Not always desirable: e.g., {university, universal} -> univers (in Porter’s)

• Alternatives include WordNet which is a dictionary-based approach

Page 11: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 11

Google n-gram Data

EXAMPLE OF TRIGAMS:ceramics collectables collectibles 55ceramics collectables fine 130ceramics collected by 52ceramics collectible pottery 50ceramics collectibles cooking 45ceramics collection , 144ceramics collection . 247ceramics collection </S> 120ceramics collection and 43

Number of tokens 1 trillionNumber of sentences 95 billionNumber of unigrams 13 millionNumber of bigrams 314 millionNumber of trigrams 977 millionNumber of fourgrams 1.3 billionNumber of fivegrams 1.2 billion

Data made available for research by Google in 2006

Page 12: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 12

Document Similarity• Measuring similarity between 2 document term vectors x and y:

– wide variety of distance metrics:• Euclidean (L2) = sqrt(i(xi - yi)2)• L1 = I |xi - yi |• ...• weighted L2 = sqrt(i(wixi - wiyi)2)

• Cosine distance between docs:

– often gives better results than Euclidean • normalizes relative to document length

yxyxyx

T

),cos(

Page 13: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 13

Distance matrices for toy document-term data

TF doc-term matrix t1 t2 t3 t4 t5 t6d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23

EuclideanDistances

CosineDistances

Page 14: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 14

TF-IDF Term Weighting Schemes

• Not all terms in a query or document may be equally important...

• TF (term freq): term weight = number of times in that document– problem: term common to many docs => low discrimination, e.g., “medical”

• IDF (inverse-document frequency of a term)– nj documents contain term j, N documents in total– IDF = log(N/nj)– Favors terms that occur in relatively few documents

• TF-IDF: TF(term)*IDF(term)

• No real theoretical basis, but works very well empirically and widely used

Page 15: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 15

TF-IDF Example

TF doc-term matrix t1 t2 t3 t4 t5 t6d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23

IDF weights are (0.1, 0.7, 0.5, 0.7, 0.4, 0.7)(using natural logs, i.e., loge)

Page 16: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 16

TF-IDF Example

TF doc-term matrix t1 t2 t3 t4 t5 t6d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23

TF-IDF doc-term mat t1 t2 t3 t4 t5 t6d1 2.5 14.6 4.6 0 0 2.1d2 3.4 6.9 2.6 0 1.1 0d3 1.3 11.1 2.6 0 0 0d4 0.6 4.9 1.0 0 0 0d5 4.5 21.5 10.2 0 1.1 0 ...

Example: TF-IDF for term t1 in doc D1 = TF*IDF = 24 * log(10/9)

IDF weights are (0.1, 0.7, 0.5, 0.7, 0.4, 0.7)(using natural logs, i.e., loge)

Page 17: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 17

Simple Document Querying System

• Queries Q = binary term vectors

• Documents represented by TF-IDF weights

• Cosine distance used for retrieval and ranking

Page 18: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 18

Example of a document-term (bag-of-words) matrix

database SQL index regression likelihood linear

d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23

Page 19: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 19

Representing a Query as a PseudoDocument

database SQL index regression likelihood linear

d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23

Q 1 0 1 0 0 0

Example query Q = [database index]

Page 20: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 20

Example: Query Q = [t1 t3]

TF TF-IDFd1 0.70 0.32d2 0.77 0.51d3 0.58 0.24d4 0.60 0.23d5 0.79 0.43 ...

TF doc-term matrix t1 t2 t3 t4 t5 t6d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23

Q = (1,0,1,0,0,0)

TF-IDF doc-term mat t1 t2 t3 t4 t5 t6d1 2.5 14.6 4.6 0 0 2.1d2 3.4 6.9 2.6 0 1.1 0d3 1.3 11.1 2.6 0 0 0d4 0.6 4.9 1.0 0 0 0d5 4.5 21.5 10.2 0 1.1 0 ...

Page 21: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 21

Synonymy and Polysemy

• Synonymy– the same concept can be expressed using different sets of terms

• e.g. bandit, brigand, thief– negatively affects recall (i.e., the number of relevant docs returned)

• Polysemy– identical terms can be used in very different semantic contexts

• bank• bear left at the zoo• time flies like an arrow

– negatively affects precision (i.e., the relevance of the returned docs)

Page 22: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 22

Latent Semantic Indexing• Approximate data in the original d-dimensional space by data in a k-dimensional

space, where k << d

• Find the k linear projections of the data that contain the most variance– Basic approach is known as principal component analysis or singular value decomposition– Also known as “latent semantic indexing” when applied to text

• Captures dependencies among terms– In effect represents original d-dimensional basis with a k-dimensional basis– e.g., terms like SQL, indexing, query, could be approximated as coming from a single

“hidden” term

• Why is this useful?– Query contains “automobile”, document contains “vehicle”– can still match Q to the document since the 2 terms will be close in k-space (but not in

original space), i.e., addresses synonymy problem

Page 23: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 23

Toy example of a document-term matrix

database

SQL index

regression

likelihood

linear

d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23

Page 24: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 24

Singular Value Decomposition (SVD)

• M = U S VT

– M = n x d = original document-term matrix (the data)

– U = n x d , each row = vector of weights for each document

– S = d x d diagonal matrix of eigenvalues

– Columns of VT = new orthogonal basis for the data

– Each eigenvalue represents how much information is of the new “basis” vectors

– Typically select just the first k basis vectors, k << d

(also known as principal components, or LSI (latent semantic indexing))

optional

Page 25: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 25

Example: SVD Applied to Document-Term Matrix

U1 U2

d1 30.9 -11.5

d2 30.3 -10.8

d3 18.0 -7.7

d4 8.4 -3.6

d5 52.7 -20.6

d6 14.2 21.8

d7 10.8

d8 11.5 28.0

d9 9.5 17.8

d10 19.9 45.0

database

SQL index regression likelihood linear

d1 24 21 9 0 0 3

d2 32 10 5 0 3 0

d3 12 16 5 0 0 0

d4 6 7 2 0 0 0

d5 43 31 20 0 3 0

d6 2 0 0 18 7 16

d7 0 0 1 32 12 0

d8 3 0 0 22 4 2

d9 1 0 0 34 27 25

d10 6 0 0 17 4 23

Page 26: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 26

v1 = [0.74, 0.49, 0.27, 0.28, 0.18, 0.19]v2 = [-0.28, -0.24 -0.12, 0.74, 0.37, 0.31]

New documents:D1 = database x 50D2 = SQL x 50

optional

Page 27: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 28

Evaluating Retrieval Methods

• Typically there no real ground truth for a query – so how can we evaluate our algorithms? Is A better than B?

• Academic research– Use small testbed data sets of documents where human labelers assign a

binary label to each document in the corpus, in terms of its relevance to a specific query Q

– repeat for different queries– very time-consuming!

• Real-world (e.g., Web search)– Can use click data as a surrogate indicator for relevancy– Can generate very large amounts of training/test data per query

• Both approaches are useful for precision, not so useful for recall

Page 28: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 29

Precision versus Recall

Rank documents (numerically) with respect to query

Compute precision and recall by thresholding the rankings

precision fraction of retrieved objects that are relevant

recall fraction of retrieved relevant objects / total relevant objects

Tradeoff: high precision -> low recall, and vice-versa

Similar to receiver-operating characteristic (ROC) in concept

For multiple queries, precision for specific ranges of recall can be averaged (so-called “interpolated precision”).

Page 29: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 30

Precision-Recall Curve (form of ROC)

C is universallyworse than A and B

Instead of evaluatingthe entire curve, wecan also look at, e.g.,(a) precision at fixedrecall (e.g., 10%)or (b) precision whenprecision=recall

Page 30: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 31

TREC evaluations

• Text Retrieval Conference (TReC)– Web site: trec.nist.gov

• Annual impartial evaluation of IR systems– e.g., D = 1 million documents– TREC organizers supply contestants with several hundred queries Q

– Each competing system provides its ranked list of documents

– Union of top 100 ranked documents or so from each system is then manually judged to be relevant or non-relevant for each query Q

– Precision, recall, etc, then calculated and systems compared

Page 31: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 32

Other Examples of Evaluation Data Sets

• Cranfield data– Number of documents = 1400– 225 Queries, “medium length”, manually constructed “test questions”– Relevance = determined by expert committee (from 1968)

• Newsgroups– Articles from 20 Usenet newsgroups– Queries = randomly selected documents– Relevance: is the document d in the same category as the query doc?

Page 32: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 33

Performance on Cranfield Document Set

Page 33: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 34

Performance on Newsgroups Data

Page 34: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 35

Related Types of Data

• Sparse high-dimensional data sets with counts, like document-term matrices, are common in data mining, e.g.,– “transaction data”

• Rows = customers• Columns = products

– Web log data (ignoring sequence)• Rows = Web surfers• Columns = Web pages

• Recommender systems– Given some products from user i, suggest other products to the user

• e.g., Amazon.com’s book recommender– Collaborative filtering:

• use k-nearest-individuals as the basis for predictions– Many similarities with querying and information retrieval

• e.g., use of cosine distance to normalize vectors

Page 35: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 36

Practical Issues: Computation Speed

• Say you are doing information retrieval at “Google-scale”– e.g., 100 billion documents in the corpus, 1 million terms in your vocabulary– So given a new query Q, you have to compute 100 billion distance calculations, each involving

1 million terms– How can this be done in “near real-time” (e.g., 200 milliseconds)

• Sparse data structures– e.g., 3 column: <docid termid count>– Vastly reduces memory requirements and speeds up search

• Inverted index – List of sorted <termid docid> pairs– useful for quickly finding only the docs that contain query terms (stage 1)

• Massively parallel processing– Different sets of docs on different processors, results then pooled, ranked

Page 36: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 37

Aspects of Web-based Search

• Additional information in Web documents– Link structure (e.g., PageRank algorithm)– HTML structure

• Link/anchor text• Title text• Etc• Can be leveraged for better retrieval

• Additional issues in Web retrieval– Scalability: size of “corpus” is huge (10’s of billions of docs)– Constantly changing:

• Crawlers to update document-term information• need schemes for efficient updating indices

– Evaluation is more difficult – how is relevance measured? How many documents in total are relevant?

Page 37: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 38

Example: Google Search Engine

• Offline:– Continuously crawl the Web to create index of Web documents– Create large-scale distributed inverted index

• Real-time: a user issues a query q– Parallel processing used to find documents that match exactly to q

(might be 1 million documents)– These documents are then scored based on 100 or more features

• Scoring is typically a logistic regression model learned from past search data for this query q, where 1 = user clicked on a link and 0 = no click

– Top 10 scoring links are displayed to the user• May be personalized (based on past search) and localized

– All of this has to happen in about ½ a second!

Page 38: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 39

Example: PubMed System

• PubMed– Free biomedical literature search service maintained by NCBI (NIH)– Over 21 million papers indexed– Abstracts, citations, etc for over 5000 life-science journals, back to 1948– The most widely-used Web tool for searching the biomedical literature

• Several million queries per day

http://www.ncbi.nlm.nih.gov/pubmed/

Page 39: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 40

Page 40: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 41

PubMed Querying

• Basic query = Boolean functions of keywords– E.g., Query = (stomach OR liver) AND cancer NOT smoking– Implicit “AND’s inserted between keywords

• e.g., NOT smoking is really AND NOT smoking in the query above– Advanced search allows one to define queries on additional fields such as

author, date, journal, MESH term, language, etc

• Query is extended to include MeSH terms– If any keyword can be mapped to MeSH, then PubMed also retrieves all

documents indexed by this MeSH term

• Ranking is done in reverse chronological order

• Queries often return many docs, e.g., over 247,000 docs for query above

Page 41: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 42

Page 42: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 43

From, Lu, PubMed and Beyond, Database, Vol 2011Systems that go beyond PubMed

Page 43: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 44

From, Lu, PubMed and Beyond, Database, Vol 2011Systems that go beyond PubMed

Page 44: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 45

Systems that go beyond PubMedFrom, Lu, PubMed and Beyond, Database, Vol 2011

Page 45: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 46

From Krallinger, Valencia, Hirschman, Genome Biology, 2008

Using Text Mining to Interpret Queries

Page 46: Introduction to Biomedical Informatics Information  Retrieval

© Hayes/Smyth: Introduction to Biomedical Informatics: 47

Further Reading

See class Web page for various pointers

Information retrieval in health/biomedical contextInformation Retrieval: A Health and Biomedical Perspective, W. Hersh, Springer, 2009

Very useful reference on indexing and searching text:Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edition, Morgan Kaufmann, 1999, by Witten, Moffat, and Bell

Web-related Document Search:An excellent resource on Web-related search is Chapter 3, Web Search and Information Retrieval, in Mining the Web: Discovering Knowledge from Hypertext Data, S. Chakrabarti, Morgan Kaufmann, 2003.

Practical aspects of how real Web search engines work:http://searchenginewatch.com/

Latent Semantic AnalysisApplied to grading of essays: The debate on automated essay grading, M. Hearst et al, IEEE Intelligent Systems, September/October 2000.