Sample Searches
• Query: smalheiser
• Response: DBLP Neil Smalheiser PNAS Abstract Smalheiser et et al 97 MBC abstract Smalheiser … UIC Dept of Psychiatry Neil Smalheiser
Query: computer science genetics
• Campus program: university, campus, college and employment resources
• SpringerLink (On-line journals and books in science, technology and medicine)
• Course ( Advanced topics in computer science and computational genetics)
• Annual Review of Computer Science
• …
Comments for Slides 1-2The first two searches are intended to show Google is reasonably goodIn retrieving Web pages when given a few keywords. In the first query,The name Smalheiser is submitted. Ideally, the home page of Smalheisershould be retrieved first. Instead, his publications in computer science areretrieved in the first document. This is followed by some of his publicationsIn the medical area. Finally, his home page in Pyschiatry is retrieved.
The second query asks for important documents in the intersection of the twoareas “computer science” and “genetics”. The first retrieved document seemsto be unrelated to the query. The second retrieved document seems ok.The third document is a course in both areas.
The examples show that Google is still far from perfect.
Information Retrieval
Document Representation:
Remove stop words:
Eg. In “Automatically IdentifyingGene terms in MEDLINE Abstracts”
Remove “in”
Stemming: “Automatically” becomes “automatic”;“Identifying” becomes “identify”;“Abstracts” becomes “abstract”
Comments on last slide
The first two steps in constructing a document representation consist of eliminating non-content wordsand mapping variations of the same word to the same stem via a process called stemming.
Document representation
• Document is a set of content words or terms:
{ automatic, identify, gene, term, medline, abstract}
Sometimes, keep locations of terms. Eg. “automatic” first word in title
Comments on last slide
Location information can be of importance in differentiating the ordering of contents words in a query from other orderings of the same words. It is also useful in determining phrases.
Assign weights to terms
Term frequency: no. of times the term occurs in the document
Document frequency: no. of documents having the term
The weight of a term in a document: proportional to term frequency, inversely proportional to document frequency
Eg term frequency * log ( N/document frequency)N = no. of documents in collection
Comments on the last slide
The well-known tf-idf weighting scheme to assign a weight to a term is given. The weight is proportional to the term frequency and inversely proportional to its document frequency. There are numerous variations of this formula, but all of them have the property that higher weights are given to terms with higher term frequencies and lower document frequencies.
Other factors in assigning weights
Terms in title;
Terms in abstract;Terms in big fonts etc.
get heavier weights
Comments to last slide
If a term occurs in the title, it usually gets a higher weight than the same term occurring in the main text. This may apply to the term appearing in the abstract. If the term occurs in big fonts or a way that attracts reader’s attention, it should also gets a higher weight.
All these situations can be implemented by assuming that each occurrence of such a term is equivalent to k occurrences of the same term in the main text with k >1.
Query representation
Two common models:
Vector space model: query as a set of terms, possibly ordered
Boolean Model: Terms connected by “AND”, “OR” and “NOT”
Comments to the last slide
In the information retrieval literature, it has been shown that the vector pace model is usually better than the Boolean Model, because if a query contains quite a few terms which are connected by “AND”s, then there may not be a document satisfying the query. If the terms are connected by “OR”s, then there may be too many unordered documents satisfying the query and the user has no efficient way to identify the useful documents from the irrelevant ones.
In practice, it is likely that a hybrid model having features of both models is used for effective retrieval.
Vector space model
Each dimension of a vector represents a distinct term;#dimensions = all terms in the collection, including proper names
Eg. Automatic identify gene … ( 1, 1, 1, ….)
Compute the similarity between a query and a document
Q = (q1, …, qn) D= (d1, …, dn)
Dot [Q, D] =
#terms in common, favors long documents
Norm( D) =
Cosine( Q, D) = Dot[Q, D] /( Norm(D)*Norm(Q))
i
id2
i
ii dq
Comments on last slide
When the documents are binary vectors, the Dot product similarity function obtains the number of terms in common between the two vectors. When the terms are weighted, the weights are incorporated into the similarity function. Clearly, this favors a long document such as an encyclopedia.
To compensate it, the norm ( length) of a document is included in the denominator of the similarity function so that a longer document gets a larger denominator. The query norm is used to ensure that the Cosine function returns a value between 0 and 1, if all terms have non-negative values. When the two vectors differ from each other by a positive multiplicative constant, their angular distance is 0 and the Cosine value is 1.
Boolean model
gene AND abstract;( sometimes, uses “+” to ensure the term needs to be present)
gene OR abstract;
gene AND NOT abstract;( uses “-” to indicate undesiredterms; Eg. +gene –abstract)
Other features
Phrase search: “information retrieval”
Proximity search: information NEAR retrieval
Date search: 2003
Field search: Eg in the field “Author”, look for “Neal”
Wildcard search: smal*er
Comments on last slide
Some systems require a query phrase suchas “information retrieval” to be placed inquotes. This may require a retrieved document to have
exactly such a phrase.If a document containing the words “retrieval of information”
is desired, the query can be reformulated as “Information” near “retrieval”.
Filtering operations can be specified by filling in additional information in specific fields such as the author field. Wildcard entries such as smal*er, where “*” denotes zero or more characters are allowed, provided that “*” does not occur in the first few characters ( say 3), otherwise the space for searching matching strings will be too large.
Additional features
Case sensitive: java gets java, Java, JAVA;Java gets Java and possibly JAVA ( first capital letter implies a proper name )
ordered query terms eg. stray dog
spelling error: if no such word,some search engines suggests similar words
Comments on last slide
Location information in documents and the query can be used to differentiate stray dog from dog stray.
If a word does not exist in the index of all words in the documents, then some search engines may suggest some neighboring words which differ from the misspelled word by 1 or 2 characters. Note that proper names are included in the index.
Directory searchSpecify subtree:
computer finance medicine
hardware software ……….. …………….
……………….
query “memory” under computer means computer memory vs human memory in medicine
Comments in last slide
Directory search may reduce ambiguities.In the given directory, documents or pages are
classified under each node. For example, there is a set of documents which are classified under computer and another class under medicine. The former class contains documents about computer memory while the latter class contains documents about human memory. If the query is restricted to the class “computer”, then only documents in the former class relating to computer memory will be retrieved.
Feedback
identify relevant documents and possibly irrelevant documents
re-formulate query using terms from relevant documents and from irrelevant documents;
Query: apple; Rel Doc: computer; Irrel: fruit
Modified query: apple, computer, - fruit
Comments to last slide
The user needs to identify relevant documents and possibly irrelevant documents. Terms from the relevant documents may be added to the query, while terms from the irrelevant documents may be used to exclude documents having such terms to be retrieved in the next round. In the example, the term “computer” is found in the relevant documents and is added to the query, while the term “fruit” is found in the irrelevant documents and it is used to exclude documents having such a word.
Web
Surface Web: linked together
Deep Web: Not linked; documents can be generated dynamically by programs
Quite a few medical databases and bio-medical databases are in the Deep Web
Comments on the last slide
The Web is roughly classified into theSurface web and the Deep Web. The pages
in the former are hyperlinked, while pages in the latter are accessible only by submitting queries to query interfaces.
Web crawlers which extract content information from Surface Web pages are unable to get into Deep Web pages for lack of hyperlinks.
Retrieval from the surface Web
Anchor text: belong to the document pointed to.
<a href="http://tigger.uic.edu/htbin/cgiwrap/bin/newsbureau/cgi-bin/index.cgi">More News</a>
Page rank: importance of a Web page
Rank( P) =
for every Qi pointing to P; iterative; Web surfing interpretation
i
ii QoutQrank ))(/)((
Comments on last slide
There are some differences between retrieval from the Web and from non-Web sources. In the former case, words known as anchor texts which appear together with the link from a page A to another page B should be utilized for retrieval. Specifically, the anchor words should be used as content words for page B, as they describe the contents of B as observed by the user who creates A.
Example to illustrate page rank
Rank(P) = ½ Rank(A) + 1/3 Rank(B)A lot of pages pointing to the IBM home page, implying that it has a very high page rank.
A
B
P
Comments on last slide
The example illustrates how the page rank of a page can be computed. In practice, all pages are initialized with the same rank and the page rank formula is applied to compute the page ranks of all pages. This process is repeated until convergence is reached. Under some reasonable assumptions, convergence is guaranteed. The page rank information is utilized to rank pages for any user query.
Query: IBM
Thousands of pages have that word, but among those pages having that word, IBM home page has largest rank.
Google utilizes page rank
Comments on last slide
There are a number of ways to utilize page ranks to rank pages for a given query. One way is to first retrieve pages which have reasonable similarities with the query. Then the retrieved pages are re-ranked in descending order of page rank. Another way is to compute the relevance of a page based on a function of the similarity of the page with the query and its page rank. Then pages are re-ranked in descending order of relevance.
Authority and Hub
• Query retrieves documents based on similarities
• Expand this set by adding their parents and their children
• Compute A(p) = sum H(q) for each edge (q,p)
• Compute H(p) = sum A(q) for each edge (p,q)
Authority and Hub continued
• Normalize A(p) and H(p)• Repeat until A() and H() converge• Output pages with top authority scores( It has been shown that convergence is
guaranteed.)
• www.teoma.com( This company claims to have an advanced
search capability which is more accurate than the standard authority and hub technique.)
Various features of different search engines, including Google, AltaVista, Hotbot etc
Search Engines for the World Wide Web
By Alfred and Emily Glossbrenner, 3rd edition,Peachpit Press, 2001.
Metasearch engine
Connects to numerous search engines.
Given query Q, finds suitable search engines to process the query, invokes the selected search engines to search and merges their results.
Comments on last slide
Instead of using a search engine such as Google, a metasearch engine which connects to numerous search engines can be utilized. Upon receiving a user query, a metasearch engine sends the query ( with possibly some modifications) to appropriate search engines and merges and re-ranks the retrieved documents returned from the invoked search engines.
Advantages of Metasearch Engines over Search Engines
Do not need substantial hardware relative to large search engines;
Large coverage;
up-to-date information.
Comments to last slide
There is no need for substantial hardware, because the searches are done by the underlying search engines. The coverage of a metasearch engine is the union of the coverages of the individual search engines. That it may have more up-to-date information than a large search engine will be explained by the next few slides.
Up-to-date information
• Search engine crawler gets data
• Builds large index database
• Time consuming to update large index database
• Metasearch engine connects to numerous small search engines
Comments to last slide
A search engine utilizes a crawler to extract contents from Surface Web pages and then builds an index database. Upon receiving a query, the search engine searches the index database to determine the pages to return to the user. Since the contents of Web pages keep on changing, the index database needs to be updated. However, the index database is large and refreshing it may take a long time, say weeks. In contrast, if a metasearch engine is connected to numerous small search engines and each of these search engines keeps its database up-to-date, the metasearch engine may be able to provide current information.
Utilizes dictionary/ontology
Wordnet: ordinary dictionary terms
MeSH hierarchy: medical terms
May want to include synonyms and hyponyms of query terms into query
Person --- (Synonyms: human, people)
Hyponyms: man woman
Comments to last slide
Dictionaries or ontologies may be utilized to achieve high retrieval effectiveness. A common dictionary in a general domain is Wordnet which provides synonyms, hyponyms as well as other relationships to each ordinary word. As an example, if a query contains the word “person”, its synonyms and hyponyms may be added to the query. Note that a word may have multiple senses (meanings) and selections of suitable synonyms and hyponyms are essential. It is worthwhile to explore the use of the MeSH hierarchy for effective retrieval in the medical domain.
Difficulty
A word sometimes has many senses
Eg Query: drugs for mental patients
senses for drugs: prescription drugs; illegal drugsuseful to include antidepressant;will retrieve a lot of irrelevant documents if include heroin
Comments to the last slide
The example shows that a correct addition of a hyponym (antidepressant is a hyponym of drug) will lead to high retrieval effectiveness while an incorrect addition (heroin is also a hyponym of drug) leads to poor retrieval results.
Natural language Processingfor Information Retrieval
• finds part-of-speech of each word;
• identify noun phrases;
• identify proper names;
• recognizes acronyms:eg. CHF congestive heart failures
• Word sense disambiguation
eg. Apple CPU
Comments to last slide
Natural language processing plays a role in information retrieval. However, so far, it is used to identify parts of speech of words, named entities and phrases only.
Recognition of acronyms is also useful for information retrieval.
Word Sense Disambiguation
Pine 1 kinds of evergreen tree with needle-shaped leaves
2 waste away through sorrow or illness
Cone 1 solid body which narrows to a point
2 fruit of certain evergreen trees
Find the combination of descriptions which have the largest number of words in common.
Comments to last slide
If the query is “pine cone” and each of the two words has multiple senses, the correct sense may be identified by finding the combination of senses whose descriptions have the largest number of words in common. In this example, sense 1 of pine and sense 2 of cone have the words “evergreen tree “ in common in their descriptions. These common words may be added to the query to improve retrieval effectiveness.
Information extraction
Information retrieval obtains whole documents; often users want small parts of retrieved documents.Examples:
From certain papers on heart disease, extract names of authors;from experimental sections of papers, extract tables of interest.
Techniques
(1) Construct rules involving patterns or keywords of identify parts of interest; utilizes a grammar to extract required information
Eg. To identify terrorist events
useful keywords: kill, bomb etc. use a grammar to identify the subjects (terrorists) and the objects (victims)
Comments
Traditionally, information extraction is achieved by manually constructed rules for the extraction, after examining numerous instances of what are desired. In order to save labor cost, machine learning techniques are introduced. Rules are automatically constructed and based on positive and negative examples, promising rules are kept for future extraction activities.
(2) Use machine learning techniques to
construct rules
Positive and negatives examples can
be given to guide the construction
Aim: Reduce manual construction of rules
Machine learning
Example: Pavilion a230 Minitower
AMD ® Athlon XP .. GHz
…….
Pavilion a210n Minitower
Intel ® Celeron … … GHz
Rule: (var1) * ‘®’ ( var2) ‘GHz’
Comments
In this example, the user supplies a few positive instances to be extracted. Then, the system automatically constructs the rule with R and GHz as landmarks. The words before the landmarks are captured by variables.
In the Web environment, HTML or XML documents have tags and they may be used to construct rules. However, rules involving tags may be site dependent, implying that new rules may need to be generated when there is a site change.
Rules involving tags
<b> Martino Motor Sales </b>
<b> Currie Motors Lincoln Memory </b>
Rule: * <b> ‘Var’ </b>
Extracted data may not be that structured
Layout of document can be site dependent, implying that new correct rules need to be constructed for new sites
Summary
• Information retrieval:
user’s point of view:
eg. phrase, case sensitive
system point of view:
eg. Feedback query construction
Web retrieval vs non-web retrieval
search engine vs metasearch engine
Summary continued
• Natural language processing Eg. acronym recognition
• Information extraction rules: manual, machine learning Can be site dependent