![Page 1: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/1.jpg)
Web Search – Summer Term 2006
II. Information Retrieval (Basics Cont.)
(c) Wolfgang Hürst, Albert-Ludwigs-University
![Page 2: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/2.jpg)
Organizational Remarks
Exercises:Please, register to the exercises by sending
me ([email protected]) an email till Friday, May 5th, with- Your name,- Matrikelnummer,- Studiengang,- Plans for exam
This is just to organize the exercises but has no effect if you decide to drop this course later.
![Page 3: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/3.jpg)
INDEX
Recap: IR System & Tasks Involved
INFORMATION NEED DOCUMENTS
User Interface
RESULTS
DOCS.
![Page 4: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/4.jpg)
INDEX
Recap: IR System & Tasks Involved
INFORMATION NEED DOCUMENTS
User InterfaceQUERY
RESULTS
DOCS.
RESULT REPRESENTATION
INDEXING
SEARCH
![Page 5: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/5.jpg)
INDEX
Recap: IR System & Tasks Involved
INFORMATION NEED DOCUMENTS
User Interface
PERFORMANCE EVALUATION
QUERY
QUERY PROCESSING (PARSING & TERM
PROCESSING)
LOGICAL VIEW OF THE INFORM. NEED
SELECT DATA FOR INDEXING
PARSING & TERM PROCESSING
SEARCHING
RANKING
RESULTS
DOCS.
RESULT REPRESENTATION
![Page 6: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/6.jpg)
Query Languages: Boolean Search
So far: a) Single terms (unrelated / bag of words) b) Boolean conjunctions (AND, OR, NOT)
Boolean search: Main search model before the Web came along (Note: Mainly professional users).
Advantages of Boolean queries:Precise (mathematical model),Offers great control and transparency,Good for domains with ranking by other means than relevance, i.e. chronological
![Page 7: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/7.jpg)
Boolean Search (Cont.)
Disadvantages of Boolean queries: Sometimes hard to specify, even for experts Binary decision (relevant or not) Bag-of-Words, no position Example: Query: New York City
Doc. 1: This is a nice city. Doc. 2: This city has a new library.
Query: New AND York AND City Doc. 1: New York has a new library. Doc. 2: The city of York has a new library.
![Page 8: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/8.jpg)
Further Query Types
Phrases, e.g.New York City
Proximity, e.g.University NEAR Freiburg (finds University of Freiburg and Albert-Ludwigs University Freiburg)
Structural queries, e.g.AUTHOR = Ottmann AND TEXT CONTAINS binary search tree
Natural language vs. keywordsPattern matching, e.g. wildcards:
index* (finds index, indexing, indexes, indexer, …)
Spelling correctionsand some more (often application dependent)
![Page 9: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/9.jpg)
Phrases
Often used (esp. for web search): Quotase.g. “New York City”Advantage: Easy and seem to work well(about 10% of web queries are such phrases according to Manning et al. [2])
How do we support this?We need word positions.We need all original words (e.g. no stop word removal in University of Freiburg).We need an efficient way to do this.
![Page 10: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/10.jpg)
Approaches to Support Phrases
Biword indexes:Idea: Store pairs of consecutive words (in
addition to single terms), e.g. New York City is represented by the terms New, York, City, New York, York City
Might cause problems for phrases with more than 2 words, but often works quite well
Positional indexes:Idea: Store position of each word in the
postings list
![Page 11: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/11.jpg)
Positional Indexes – Example
…47322523
…
18453CITY
9421YORK
23535NEW
…
…47252318
…55534725
23:4[3,12,46,78]
25:3[43,120,221]
32:6[12,20,57,200,322,481]
…
NEW 23535 …,25:6[41,87,136,…], …
YORK 9421 …,25:2[42,137], …
![Page 12: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/12.jpg)
Positional Indexes
Also works for queries such asUniversity [word]1 FreiburgUniversity NEAR Freiburg
Problem: SizeNeed to store additional info (positions) on an already large index (stop words!)Approx. size: 2-4 times the original index, 1/2 size of uncompressed documents [2]
In practice:Combinations exist, e.g. index w. names as phrases, useful biwords, and store position
![Page 13: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/13.jpg)
Pattern Matching – Wildcards
Example: fußball* is mapped to fußballer, fußballspiel, fußballweltmeister, …
Trailing wildcard queries, e.g. fußball* Can easily be found if dictionary is stored as a
B-tree
Leading wildcard queries, e.g. *meister Can easily be found if dictionary is stored as a
reverse B-tree (i.e. terms stored backwards)
![Page 14: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/14.jpg)
Wildcards (Cont.)General wildcards, e.g. f*ball
(matches e.g. to fußball, federball, …)
Idea: Move the * at the endPermuterm index:
For each word (e.g. fußball) add end symbol (e.g. fußball$) and create permutations (e.g. fußball$, ußball$f, ßball$fu, ball$fuß, …, l$fußbal, $fußball)
Permuterm index:dictionary = all permuterms,postings = dictionary terms containing this rotation
Query: Permute * to the end (e.g. ball$f*) and get postings from permuterm index (e.g. ball$fuß, ball$feder, …)
![Page 15: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/15.jpg)
Structural QueriesIn practice: Often semi-structured documents
Structural queries: Use available structure to better specify the information need, e.g.AUTHOR = Ottmann AND TEXT CONTAINS search tree
Requires to store structure information, e.g.in a parametric indexencoded inthe dictionary:
or in the postings:
OTTMANN.TITLE
OTTMANN.BODY
OTTMANN.AUTHOR 9 17 19 …28
8 9 17 …23
12 26 44 …48
OTTMANN 8.BODY 9.AUTHOR, 9.BODY 12.TITLE …
![Page 16: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/16.jpg)
Summary: Further Query Types
Phrases, e.g.New York City
Proximity, e.g.University NEAR Freiburg (finds University of Freiburg and Albert-Ludwigs University Freiburg)
Structural queries, e.g.AUTHOR = Ottmann AND TEXT CONTAINS binary search tree
Natural language vs. keywordsPattern matching, e.g. wildcards:
index* (finds index, indexing, indexes, indexer, …)
Spelling correctionsand some more (often application dependent)
![Page 17: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/17.jpg)
INDEX
Recap: IR System & Tasks Involved
INFORMATION NEED DOCUMENTS
User Interface
PERFORMANCE EVALUATION
QUERY
QUERY PROCESSING (PARSING & TERM
PROCESSING)
LOGICAL VIEW OF THE INFORM. NEED
SELECT DATA FOR INDEXING
PARSING & TERM PROCESSING
SEARCHING
RANKING
RESULTS
DOCS.
RESULT REPRESENTATION
![Page 18: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/18.jpg)
Ranking – MotivationSo far: Mapping
of processed words from the queryto processed words from the documents
Set of (hopefully) relevant documentsSimilar to Boolean search, either
explicitly specified by the user (q1 AND q2) orimplicitly done by the system, e.g. by returning docs with all query terms (AND) by returning docs with any query term (OR)
Intuitively:A doc. containing more different query terms than another one seems more relevant.
![Page 19: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/19.jpg)
Estimating RelevanceQuestion: How can we estimate relevance based
on a given query and a document collection?
Different terms might have a different influence on relevancee.g. stop words are less relevant than names
Documents containing more (different) query terms might be more relevante.g. New York (state and city) vs. New York City
Documents containing an important term more often might be more relevante.g. one query term: doc. 1 contains query term 200 times, doc. 2 contains it just 5 times
![Page 20: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/20.jpg)
VOCABULARY: FACTORS, INFORMATION, HELP, HUMAN, OPERATION, RETRIEVAL, SYSTEMS (VECTOR = (1 1 1 1 1 1 1))
QUERY = {HUMAN FACTORS IN INFORMATION RETRIEVAL SYSTEMS}VECTOR REPRESENTATION = (1 1 0 1 0 1 1)
DOCUMENT 1: {HUMAN, FACTORS, INFORMATION, RETRIEVAL}VECTOR REPRESENTATION = (1 1 0 1 0 1 0)
DOCUMENT 2: {HUMAN, FACTORS, HELP, SYSTEMS}VECTOR REPRESENTATION = (1 0 0 0 1 0 1)
DOCUMENT 3: {FACTORS, OPERATION, SYSTEMS}VECTOR REPRESENTATION = (1 0 0 0 1 0 1)
EXAMPLE FOR TERM
WEIGH-TING
SOURCE: FRAKES ET
AL. [3], PAGE 365
SIMPLE MATCH
QUERY (1 1 0 1 0 1 1)DOC 1 (1 1 0 1 0 1 0) (1 1 0 1 0 1 0) = 4
QUERY (1 1 0 1 0 1 1)DOC 2 (1 0 1 1 0 0 1) (1 0 0 1 0 0 1) = 3
QUERY (1 1 0 1 0 1 1)DOC 3 (1 0 0 0 1 0 1) (1 0 0 0 0 0 1) = 2
WEIGHTED MATCH
QUERY (1 1 0 1 0 1 1)DOC 1 (2 3 0 5 0 3 0) (2 3 0 5 0 3 0) = 13
QUERY (1 1 0 1 0 1 1)DOC 2 (2 0 4 5 0 0 1) (2 0 0 5 0 0 1) = 8
QUERY (1 1 0 1 0 1 1)DOC 3 (2 0 0 0 2 0 1) (2 0 0 0 0 0 1) = 3
![Page 21: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/21.jpg)
Term Frequency (TF)In practice: Various experiments have confirmed
that Term Frequency (TF) is a significant measure for relevance
But: It depends on the document’s lengthTherefore: Normalization
#T = FREQUENCY OF TERM T IN DOC. D
DL = DOCUMENT LENGTH = NO. TERMS IN D
TERMS (SORTED BY # OF APPEARANCES)
# A
PPEA
RA
NC
ES
![Page 22: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/22.jpg)
Inverse Document Frequency (IDF)Observation: Relevance of a term also depends
on its frequency in the whole collection.Example: Query = Amazon Rain Forrest
NEWSPAPER ARCHIVE AMAZON.COM PRESS RELEASES
Inverse Document Frequency (IDF):
![Page 23: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/23.jpg)
The TF*IDF MeasureTF (T, D) = # appearances in one document
Estimation for how good a term represents the content of 1 document (intra document frequency)
IDF (T) = Inv. of # appearances in the collectionEstimation for how good a term separates different documents (inv. of inter document frequency)
Combined measure / weight:
TF*IDF (T, D) = TF (T, D) * IDF (T)
(#T, DL, N as defined before)
![Page 24: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/24.jpg)
TF*IDF Weighting – CommentsNote: Different definitions / versions exist
Based on the application and data other weights might be used, e.g.Structure information (e.g. term in title, abstract, …)Popularity (e.g. Titanic in a movie data base)Relative position between terms (e.g. Amazon Rain Forrest vs. Amazon Press Releases)Date (e.g. news archive: newer = more relevant)Layout (e.g. bold faced font)etc.
However, TF*IDF often has a high impact
![Page 25: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/25.jpg)
2 of the Most Imp. Weighting Fcts.
SOURCE: AMIT SINGHAL MODERN INFORM. RETRIEVAL: A BRIEF
OVERVIEW, IEEE BULLETIN, 2001
Okapi weighting based document score:
Pivoted normalization weighting based doc. score:
with
tf = the term‘s frequency in the document
qtf = the term‘s frequency in the query
N = the total number of documents in the collection
df = the number of documents that contain the term
dl = the document length (in bytes)
avdl = the average document length
![Page 26: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/26.jpg)
INDEX
Recap: IR System & Tasks Involved
INFORMATION NEED DOCUMENTS
User Interface
PERFORMANCE EVALUATION
QUERY
QUERY PROCESSING (PARSING & TERM
PROCESSING)
LOGICAL VIEW OF THE INFORM. NEED
SELECT DATA FOR INDEXING
PARSING & TERM PROCESSING
SEARCHING
RANKING
RESULTS
DOCS.
RESULT REPRESENTATION
![Page 27: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/27.jpg)
Evaluation of IR Systems
Standard approaches for algorithm and computer system evaluationSpeed / processing timeStorage requirementsCorrectness of used algorithms and their implementation
But most importantlyPerformance, effectiveness
Another important issue:Usability, users’ perception
Questions: What is a good / better search engine? How to measure search engine quality? How to perform evaluations? Etc.
![Page 28: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/28.jpg)
What does Performance/Effectivenessof IR Systems mean?
Typical questions:How good is the quality of a system?Which system should I buy? Which one is better?How can I measure the quality of a system?What does quality mean for me? Etc.
Their answer depends on users, application, … Very different views and perceptions
User vs. search engine provider, developer vs. manager, seller vs. buyer, …
And remember: Queries can be ambiguous, unspecific, etc.
Hence, in practice, use restrictions and idealization, e.g. only binary decisions
![Page 29: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/29.jpg)
Precision & Recall
PRECISION =# FOUND & RELEVANT
# FOUND
RECALL =# FOUND & RELEVANT
# RELEVANT
RESULT:DOCUMENTS:
A CD B
F HGEJ
I
1. DOC. B
2. DOC. E
3. DOC. F
4. DOC. G
5. DOC. D
6. DOC. H Restrictions:
0/1 Relevance,Set instead of order/ranking
But: We can use this for eval. of ranking, too(via top N docs.)
![Page 30: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/30.jpg)
Calculating Precision & Recall
Precision:Can be calculated directly from the result
Recall:Requires relevance ratings for whole (!) data collectionIn practice: Approaches to estimate recall1.) Use a representative sample instead of whole data collection2.) Document-source method3.) Expanding queries4.) Compare result with external sources5.) Pooling method
![Page 31: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/31.jpg)
Precision & Recall – Special cases
Special treatment is necessary, if no doc. is found or no relevant docs. exist (division by zero)
NO REL. DOC. EXISTS:A = C = 01st CASE:
B = 0
2nd CASE:B > 0
EMPTY RESULT SET:A = B = 01st CASE:
C = 0
2nd CASE:C > 0
A
B
C
D
D
BD
D
C
D
![Page 32: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/32.jpg)
Precision & Recall Graphs
Comparing 2 systems:Prec 1 = 0.6, Rec 1 = 0.3Prec 2 = 0.4, Rec 2 = 0.6
Which one is better?
Prec.-Recall-Graph:
PR
EC
ISIO
N
RECALL
![Page 33: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/33.jpg)
References & Recommended Reading[1] R. BAEZA-YATES, B. RIBEIRO-NETO:
MODERN INFORMATIN RETRIEVAL, ADDISON WESLEY, 1999CHAPTER 4 (QUERY LANGUAGES)
[2] C. MANNING, P. RAGHAVAN, H. SCHÜTZ: INTRODUCTIONTO INFORMATION RETRIEVAL (TO APPEAR 2007)CHAPTER 1.4, 2.2.2, 4.1, 6.1 (QUERY LANG.)CHAPTER 6.2 (RANKING / RELEVANCE)
DRAFT AVAILABLE ONLINE AT http://www-csli.stanford.edu/ ~schuetze/information-retrieval-book.html
[3] WILLIAM B. FRAKES, RICARDO BAEZA-YATES (EDS.): INFORMATION RETRIEVAL – DATA STRUCTURES AND ALGORITHMS, P T R PRENTICE HALL, 1992CHAPTER 14: RANKING ALGORITHMS
[4] G. SALTON: A BLUEPRINT FOR AUTOMATIC INDEXING, ACM SIGIR FORUM, VOL. 16, ISSUE 2, FALL 1981(TERMPROCESSING, RANKING / RELEVANCE)
(REFERENCES FOR EVALUATION: NEXT TIME)
![Page 34: Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)](https://reader034.vdocuments.net/reader034/viewer/2022052702/568153c6550346895dc1bfe6/html5/thumbnails/34.jpg)
ScheduleIntroductionIR-Basics (Lectures) Overview, terms and definitions Index (inverted files) Term processing Query processing Ranking (TF*IDF, …) Evaluation IR-Models (Boolean, vector, probab.)IR-Basics (Exercises)Web Search (Lectures and exercises)