8/28/97information organization and retrieval ir implementation issues, web crawlers and web search...
Post on 21-Dec-2015
213 views
TRANSCRIPT
8/28/97 Information Organization and Retrieval
IR Implementation Issues, Web Crawlers and Web Search
Engines
University of California, Berkeley
School of Information Management and Systems
SIMS 202: Information Organization and Retrieval
8/28/97 Information Organization and Retrieval
Review
• Boolean Retrieval
• Ranked Retrieval
• Vector Space Model
8/28/97 Information Organization and Retrieval
Boolean Model
t33
t11 t22
D11D22
D33
D44D55
D66
D88D77
D99
D1010
D1111
m1
m2
m3m5
m4
m7m8
m6
m2 = t1 t2 t3
m1 = t1 t2 t3
m4 = t1 t2 t3
m3 = t1 t2 t3
m6 = t1 t2 t3
m5 = t1 t2 t3
m8 = t1 t2 t3
m7 = t1 t2 t3
8/28/97 Information Organization and Retrieval
Boolean Searching“Measurement of thewidth of cracks in prestressedconcrete beams”
Formal Query:cracks AND beamsAND Width_measurementAND Prestressed_concrete
Cracks
Beams Widthmeasurement
Prestressedconcrete
Relaxed Query:(C AND B AND P) OR(C AND B AND W) OR(C AND W AND P) OR(B AND W AND P)
8/28/97 Information Organization and Retrieval
Boolean Problems
• Disjunctive (OR) queries lead to information overload
• Conjunctive (AND) queries lead to reduced, and commonly zero result
• Conjunctive queries imply reduction in Recall
8/28/97 Information Organization and Retrieval
Advantages and Disadvantage of the Boolean Model
• Complete expressiveness for any identifiable subset of collection
• Exact and simple to program
• The whole panoply of Boolean Algebra available
Advantages• Complex query syntax
is often misunderstood (if understood at all)
• Problems of Null output and Information Overload
• Output is not ordered in any useful fashion
Disadvantages
8/28/97 Information Organization and Retrieval
Boolean Extensions
• Fuzzy Logic– Adds weights to each term/concept– ta AND tb is interpreted as MIN(w(ta),w(tb))– ta OR tb is interpreted as MAX (w(ta),w(tb))
• Proximity/Adjacency operators– Interpreted as additional constraints on Boolean
AND• TOPIC system
– Uses various weighted forms of Boolean logic and proximity information in calculating RSVs
8/28/97 Information Organization and Retrieval
Vector Space Model
• Documents are represented as vectors in term space– Terms are usually stems– Documents represented by binary vectors of
terms• Queries represented the same as documents• Query and Document weights are based on
length and direction of their vector• A vector distance measure between the
query and documents is used to rank retrieved documents
8/28/97 Information Organization and Retrieval
Documents in Vector Space
t1
t2
t3
D1
D2
D10
D3
D9
D4
D7
D8
D5
D11
D6
8/28/97 Information Organization and Retrieval
Vector Space Documentsand Queries
docs t1 t2 t3 RSV=Q.DiD1 1 0 1 4D2 1 0 0 1D3 0 1 1 5D4 1 0 0 1D5 1 1 1 6D6 1 1 0 3D7 0 1 0 2D8 0 1 0 2D9 0 0 1 3
D10 0 1 1 5D11 1 0 1 3Q 1 2 3
q1 q2 q3
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
t2
t3
t1
8/28/97 Information Organization and Retrieval
Similarity Measures
|)||,min(|
||
||||
||
||||
||||
||2
||
21
21
DQ
DQ
DQ
DQ
DQDQ
DQ
DQ
DQ
Simple matching (coordination level match)
Dice’s Coefficient
Jaccard’s Coefficient
Cosine Coefficient
Overlap Coefficient
8/28/97 Information Organization and Retrieval
Vector Space with Term Weights and Cosine Matching
1.0
0.8
0.6
0.4
0.2
0.80.60.40.20 1.0
D2
D1
Q
1
2
Term B
Term A
Di=(di1,wdi1;di2, wdi2;…;dit, wdit)Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)
t
j
t
j dq
t
j dq
i
ijj
ijj
ww
wwDQsim
1 1
22
1
)()(),(
Q = (0.4,0.8)D1=(0.8,0.3)D2=(0.2,0.7)
98.042.0
64.0
])7.0()2.0[(])8.0()4.0[(
)7.08.0()2.04.0()2,(
2222
DQsim
74.058.0
56.),( 1 DQsim
8/28/97 Information Organization and Retrieval
Problems with Vector Space
• There is no real theoretical basis for the assumption of a term space– it is more for visualization that having any real
basis– most similarity measures work about the same
regardless of model
• Terms are not really orthogonal dimensions– Terms are not independent of all other terms
8/28/97 Information Organization and Retrieval
Today
• Probabilistic Retrieval (Introduction)
• Processing Ranked Queries (the role of inverted files)
• Web Crawlers - Distributed indexing of the WWW
• Probabilistic Retrieval (Details)
8/28/97 Information Organization and Retrieval
Probabilistic Retrieval
• Goes back to 1960’s (Maron and Kuhns)
• Robertson’s “Probabilistic Ranking Principle”– Retrieved documents should be ranked in
decreasing probability that they are relevant to the user’s query.
– How to estimate these probabilities?• Several methods (Model 1, Model 2, Model 3) with
different emphases on how estimates are done.
8/28/97 Information Organization and Retrieval
Probabilistic Models: Some Notation
• D = All present and future documents
• Q = All present and future queries
• (Di,Qj) = A document query pair
• x = class of similar documents,
• y = class of similar queries,
• Relevance is a relation:
}Q submittinguser by therelevant judged
isDdocument ,Q ,D | )Q,{(D R
j
ijiji QD
Dx Qy
8/28/97 Information Organization and Retrieval
Probabilistic Models
• Model 1 -- Probabilistic Indexing, P(R|y,Di)
• Model 2 -- Probabilistic Querying, P(R|Qj,x)
• Model 3 -- Merged Model, P(R| Qj, Di)
• Model 0 -- P(R|y,x)
• Probabilities are estimated based on prior usage or relevance estimation
8/28/97 Information Organization and Retrieval
Probabilistic Models
• Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query
• Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle)
• Relies on accurate estimates of probabilities for accurate results
8/28/97 Information Organization and Retrieval
Vector and Probabilistic Models
• Support “natural language” queries• Treat documents and queries the same• Support relevance feedback searching• Support ranked retrieval• Differ primarily in theoretical basis and in how
the ranking is calculated– Vector assumes relevance – Probabilistic relies on relevance judgments or
estimates
8/28/97 Information Organization and Retrieval
Web Search Engines
• Most include some version of Vector Space or extended Boolean
• Some offer both “ranked” and Boolean, but not together.
• Some engines (such as those based on the original WAIS) are little more than coordination-level matching for ranked retrieval.
8/28/97 Information Organization and Retrieval
Web Search Engines
• Some engines use added natural language processing techniques to identify concepts– Lycos based on work by Michael Mauldin at CMU– Excite’s “concept-based” search may be a
development of Latent Semantic Indexing
• Some search engines using Probabilistic methods (with proprietary extensions)– Inktomi/HotBot uses a form of SLR.
8/28/97 Information Organization and Retrieval
Web Search Engines
• Exact algorithms are not available for commercial WWW search engines
• Many search engines appear to be hybrids offering both ranked and Boolean elements
8/28/97 Information Organization and Retrieval
Web Search Conclusions
• Web Search engines are stretching the performance limits of ranked retrieval algorithms
• Most Web search engines today attempt to combine the best features of ranked and Boolean searching
• There is still a long way to go before All and Only the Relevant web pages are retrieved in response to your query
8/28/97 Information Organization and Retrieval
Web Crawlers
• How do the web search engines get all of the items they index?
• How do you store millions of words from hundreds of sites so that you can find them quickly (and efficiently)?
8/28/97 Information Organization and Retrieval
Depth-First Crawling
Page 1
Page 3Page 2
Page 1
Page 2
Page 1
Page 5
Page 6
Page 4Page 1
Page 2
Page 1
Page 3
Site 6
Site 5
Site 3
Site 1 Site 2
Site Page1 11 21 41 61 31 53 15 16 15 22 12 22 3
8/28/97 Information Organization and Retrieval
Breadth First
Page 1
Page 3Page 2
Page 1
Page 2
Page 1
Page 5
Page 6
Page 4Page 1
Page 2
Page 1
Page 3
Site 6
Site 5
Site 3
Site 1 Site 2
Site Page1 12 11 21 61 32 22 31 43 11 55 15 26 1
8/28/97 Information Organization and Retrieval
Inverted Files• We have seen “Vector files” conceptually,
an Inverted File is a vector file “inverted” so that rows become columns and columns become rowsdocs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1
D10 0 1 1
Terms D1 D2 D3 D4 D5 D6 D7 …
t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0
8/28/97 Information Organization and Retrieval
How Are Inverted Files Created
• Documents are parsed to extract words (or stems) and these are saved with the Document ID.
Now is the timefor all good men
to come to the aidof their country
Doc 1
It was a dark andstormy night in
the country manor. The time was past midnight
Doc 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
8/28/97 Information Organization and Retrieval
How Inverted Files are Created
• After all document have been parsed the inverted file is sorted
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
8/28/97 Information Organization and Retrieval
How Inverted Files are Created
• Multiple term entries for a single document are merged and frequency information added
Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
8/28/97 Information Organization and Retrieval
How Inverted Files are Created• The file is split into a Dictionary and a
Postings fileTerm Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
8/28/97 Information Organization and Retrieval
Inverted files
• Permit fast search for individual terms• Search results for each term is a list of
document IDs (and optionally, frequency and/or positional information)
• These lists can be used to solve Boolean queries:– country: d1, d2– manor: d2– country and manor: d2
8/28/97 Information Organization and Retrieval
Inverted Files
• Lots of alternative implementations – E.g.: Cheshire builds within-document
frequency using a hash table during parsing– Document IDs and frequency info are stored in
a B-tree index keyed by the term.
• See the chapter on inverted files in the reader for other implementations.
8/28/97 Information Organization and Retrieval
Probabilistic Models (Again)
• Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query
• Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle)
• Relies on accurate estimates of probabilities for accurate results
8/28/97 Information Organization and Retrieval
Probabilistic Models: Logistic Regression
• Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables.
nnkji vcvcvcctdR|qO ...),,(log 22110
)),|(log(1
1),|(
ji dqROjie
dqRP
m
kkjiji ROtdqROdqRO
1, )](log),|([log),|(log
Log odds of relevance is a linear function of attributes:
Term contributions summed:
Probability of Relevance is inverse of log odds:
8/28/97 Information Organization and Retrieval
Probabilistic Models: Logistic Regression attributes
MX
n
nNIDF
IDFM
X
DLX
DAFM
X
QLX
QAFM
X
j
j
j
j
j
t
t
M
t
M
t
M
t
log
log1
log1
log1
6
15
4
13
2
11
Average Absolute Query Frequency
Query Length
Average Absolute Document Frequency
Document Length
Average Inverse Document Frequency
Inverse Document Frequency
Number of Terms in common between query and document -- logged
8/28/97 Information Organization and Retrieval
Probabilistic Models: Logistic Regression
6
10),|(
iii XccDQRP
Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by:
For the 6 X attribute measures shown previously
8/28/97 Information Organization and Retrieval
Probabilistic Models
• Strong theoretical basis
• In principle should supply the best predictions of relevance given available information
• Can be implemented similarly to Vector
• Relevance information is required -- or is “guestimated”
• Important indicators of relevance may not be term -- though terms only are usually used
• Optimally requires on-going collection of relevance information
Advantages Disadvantages