administrivia

42
06/27/22 11:35 1 Copyright © Kambhampati / Weld 2002 Administrivia Project?

Upload: maggie-thomas

Post on 31-Dec-2015

14 views

Category:

Documents


0 download

DESCRIPTION

Administrivia. Project?. Overview. History Networking Crawlers Information Retrieval. ?. Outline of IR topics. Background Definitions, etc. The Problem 100,000+ pages The Solution Ranking docs Vector space Extensions Relevance feedback, clustering, query expansion, etc. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Administrivia

04/19/23 20:40 1Copyright © Kambhampati / Weld 2002

Administrivia

• Project?

Page 2: Administrivia

04/19/23 20:40 2Copyright © Kambhampati / Weld 2002

Overview

• History

• Networking

• Crawlers

• Information Retrieval

?

Page 3: Administrivia

04/19/23 20:40 3Copyright © Kambhampati / Weld 2002

Outline of IR topics

• Background– Definitions, etc.

• The Problem– 100,000+ pages

• The Solution– Ranking docs

– Vector space

• Extensions– Relevance feedback,

– clustering,

– query expansion, etc.

Page 4: Administrivia

04/19/23 20:40 4Copyright © Kambhampati / Weld 2002

Information retrieval• Traditional Model

– Given• a set of documents• A query expressed as a set of

keywords

– Return• A ranked set of documents most

relevant to the query

– Evaluation:• Precision: Fraction of returned

documents that are relevant• Recall: Fraction of relevant

documents that are returned• Efficiency

• Web-induced headaches– Scale (billions of

documents)– Hypertext (inter-document

connections)

• Consequently– Ranking that takes link

structure into account• Authority/Hub

– Indexing and Retrieval algorithms that are ultra fast

Page 5: Administrivia

04/19/23 20:40 5Copyright © Kambhampati / Weld 2002

What is Information Retrieval• Given a large repository of documents, …• How do I get at the ones that I want

– Examples: Lexus/Nexus, Medical reports, AltaVista• Keyword based [can’t handle synonymy, polysemy]

• Different from databases– Unstructured (or semi-structured) data– Information is (typically) text– Requests are (typically) word-based

In principle, this

requires NLP!

--NLP too hard as yet

--IR tries to

get by

with syntactic methods

Page 6: Administrivia

04/19/23 20:40 6Copyright © Kambhampati / Weld 2002

What is IR continued• IR =

– representation, – storage, – organization of, and – access to

• Focus is on the user information need• User information need:

– Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament.

• Emphasis is: retrieval of information (not data)

} information items

Page 7: Administrivia

04/19/23 20:40 7Copyright © Kambhampati / Weld 2002

Information vs. Data

• Data retrieval• which docs contain a set of keywords?• Well defined semantics• a single erroneous object implies failure!

• Information retrieval• information about a subject or topic• semantics is frequently loose• small errors are tolerated

• IR system:• interpret contents of information items• generate a ranking which reflects relevance• notion of relevance is most important

Page 8: Administrivia

04/19/23 20:40 8Copyright © Kambhampati / Weld 2002

IR - Past and Present

• IR at the center of the stage– IR in the last 20 years:

• classification and categorization

• systems and languages

• user interfaces and visualization

– The Web has renewed focus on IR• universal repository of knowledge

• free (low cost) universal access

• no central editorial board

• many problems though: IR seen as key to finding the solutions!

Page 9: Administrivia

04/19/23 20:40 9Copyright © Kambhampati / Weld 2002

Classic IR Models - Basic Concepts• Each document represented by a

set of representative keywords or index terms– Query is seen as a

“mini”document• An index term is a document

word useful for remembering the document main themes– Usually, index terms are nouns

because nouns have meaning by themselves

• [However, search engines assume that all words are index terms (full text representation)]

Docs

Information Need

Index Terms

doc

query

Rankingmatch

Page 10: Administrivia

04/19/23 20:40 10Copyright © Kambhampati / Weld 2002

UserInterface

Text Operations (stemming, noun phrase detection etc..)

Query Operations(elaboration, relevance feedback

Indexing

Searching(hash tables etc.)

Ranking(vector models ..)

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

4, 10

6, 7

5 8

2

8

Text Database

Text

The Retrieval Process

Page 11: Administrivia

04/19/23 20:40 11Copyright © Kambhampati / Weld 2002

Generating keywords

• Logical view of the documents

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text Index terms

Stop-word elimination

Noun phrase detection

Stemming

Generating index terms

Improving quality of terms.

Synonyms, co-occurence detection, latent semantic indexing..

Page 12: Administrivia

04/19/23 20:40 12Copyright © Kambhampati / Weld 2002

The number of Web pages on the World Wide Web was

estimated to be over 800 million in 1999.

Stop word eliminationStemming

Example of Stemming and Stopword Elimination

Page 13: Administrivia

04/19/23 20:40 13Copyright © Kambhampati / Weld 2002

A quick glimpse at inverted filesDictionary PostingsTerm Doc # Freq

a 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Page 14: Administrivia

04/19/23 20:40 14Copyright © Kambhampati / Weld 2002

• Is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query

• Is based on fundamental premisses regarding the notion of relevance, such as:– common sets of index terms

– sharing of weighted terms

– likelihood of relevance

• Each set of premisses leads to a distinct IR model

Ranking

Page 15: Administrivia

04/19/23 20:40 15Copyright © Kambhampati / Weld 2002

Non-Overlapping ListsProximal Nodes

Structured Models

Retrieval: Adhoc Filtering

Browsing

U s e r

T a s k

Classic Models

boolean vector probabilistic

Set Theoretic

Fuzzy Extended Boolean

Probabilistic

Inference Network Belief Network

Algebraic

Generalized Vector Lat. Semantic Index Neural Networks

Browsing

Flat Structure Guided Hypertext

IR Models

Page 16: Administrivia

04/19/23 20:40 16Copyright © Kambhampati / Weld 2002

How Measure Performance

• Set of documents

• Docs the user desires

• User query

Page 17: Administrivia

04/19/23 20:40 17Copyright © Kambhampati / Weld 2002

Measuring Performance

• Precision– Proportion of selected

items that are correct

• Recall– Proportion of target

items that were selected

• Precision-Recall curve– Shows tradeoff

tn

fp tp fn

System returned these

Actual relevant docs

fptp

tp

fntp

tp

Recall

Precision

Page 18: Administrivia

04/19/23 20:40 18Copyright © Kambhampati / Weld 2002

Precision / Recall Curves11-point recall-precision curve

Example: - Suppose for a given query10 documents are relevant.- Suppose when all documents are ranked in

descending similarities, we have

d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31 …

recall

pre

cisi

on

.1 .3 1.0

Page 19: Administrivia

04/19/23 20:40 19Copyright © Kambhampati / Weld 2002

Precision Recall Curves…

When evaluating the retrieval effectiveness of a text retrieval system or method, a large number of queries are used and their average 11-point recall-precision curve is plotted

recall

pre

cisi

on

Method 1Method 2Method 3

Page 20: Administrivia

04/19/23 20:40 20Copyright © Kambhampati / Weld 2002

• Methods 1 and 2 are better than method 3.

• Method 1 is better than method 2 for high recalls.

Page 21: Administrivia

04/19/23 20:40 21Copyright © Kambhampati / Weld 2002

Non-Overlapping ListsProximal Nodes

Structured Models

Retrieval: Adhoc Filtering

Browsing

U s e r

T a s k

Classic Models

boolean vector probabilistic

Set Theoretic

Fuzzy Extended Boolean

Probabilistic

Inference Network Belief Network

Algebraic

Generalized Vector Lat. Semantic Index Neural Networks

Browsing

Flat Structure Guided Hypertext

IR Models

Page 22: Administrivia

04/19/23 20:40 22Copyright © Kambhampati / Weld 2002

Documents as bags of wordsa: System and human system engineering

testing of EPSb: A survey of user opinion of computer

system response time c: The EPS user interface management

system d: Human machine interface for ABC

computer applications e: Relation of user perceived response time to

error measurement f: The generation of random, binary, ordered

trees g: The intersection graph of paths in trees h: Graph minors IV: Widths of trees and

well-quasi-ordering i: Graph minors: A survey

a b c d e f g h IInterface 0 0 1 0 0 0 0 0 0User 0 1 1 0 1 0 0 0 0System 2 1 1 0 0 0 0 0 0Human 1 0 0 1 0 0 0 0 0Computer 0 1 0 1 0 0 0 0 0Response 0 1 0 0 1 0 0 0 0Time 0 1 0 0 1 0 0 0 0EPS 1 0 1 0 0 0 0 0 0Survey 0 1 0 0 0 0 0 0 1Trees 0 0 0 0 0 1 1 1 0Graph 0 0 0 0 0 0 1 1 1Minors 0 0 0 0 0 0 0 1 1

Page 23: Administrivia

04/19/23 20:40 23Copyright © Kambhampati / Weld 2002

• Not all terms are equally useful for representing the document contents

• Less frequent terms allow identifying a narrower set of documents– The importance of the

index terms is represented by weights associated to them

– ki is an index term

– dj is a document

– t is the total number of docs

– K = (k1, k2, …, kt) is the set of all index terms

– wij >= 0 is a weight associated with (ki,dj)

• wij = 0 indicates term missing from doc

– vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the document dj (or query q)

Terminology: Term Weights

Page 24: Administrivia

04/19/23 20:40 24Copyright © Kambhampati / Weld 2002

• Simple model based on set theory

• Queries specified as boolean expressions – precise semantics

• Terms are either present or absent. – Thus, wij {0,1}

• Consider– q = ka (kb kc)– dnf(q) = (1,1,1) (1,1,0) (1,0,0)– cc = (1,1,0) is a conjunctive component

The Boolean Model

Page 25: Administrivia

04/19/23 20:40 25Copyright © Kambhampati / Weld 2002

q = ka (kb kc)

sim(q,dj) = 1 if cc dnf(q) ki q, vec(dj)i = cci

0 otherwise

(1,1,1)(1,0,0)

(1,1,0)

Ka Kb

Kc

The Boolean Model Illustrated

Page 26: Administrivia

04/19/23 20:40 26Copyright © Kambhampati / Weld 2002

Evaluation of Boolean Model

Page 27: Administrivia

04/19/23 20:40 27Copyright © Kambhampati / Weld 2002

• Binary decision criteria– No notion of partial matching– No ranking or grading scale

• Users must write Boolean expression – Awkward– Often too simplistic

• Thus users get too few or too many documents

Drawbacks of the Boolean Model

Page 28: Administrivia

04/19/23 20:40 28Copyright © Kambhampati / Weld 2002

• Use of binary weights is too limiting

• [0, 1] term weights are used to compute– Degree of similarity between a query and documents

• Allows ranking of results

Thus ... The Vector Model

Page 29: Administrivia

04/19/23 20:40 29Copyright © Kambhampati / Weld 2002

• Documents/Queries modeled as bags of words – Represented as vectors over keyword space

– vec(dj) = (w1j, w2j, ..., wtj)– vec(q) = (w1q, w2q, ..., wtq)

• wij > 0 whenever ki dj• wiq >= 0 associated with the pair (ki,q)

– To each term ki is associated a unitary vector vec(i)• Unitary vectors vec(i) and vec(j) are assumed orthonormal • What does this mean?

» Is this Reasonable??????

t unitary vectors vec(i) form – an orthonormal basis for a t-dimensional space

Each vector holds a

place for e

very term

in the colle

ction

Therefore, most

vectors are sparse

The Vector Model Definitions

Page 30: Administrivia

04/19/23 20:40 30Copyright © Kambhampati / Weld 2002

Vector Space Examplea: System and human system engineering

testing of EPSb: A survey of user opinion of computer

system response time c: The EPS user interface management

system d: Human machine interface for ABC

computer applications e: Relation of user perceived response time

to error measurement f: The generation of random, binary,

ordered trees g: The intersection graph of paths in trees h: Graph minors IV: Widths of trees and

well-quasi-ordering i: Graph minors: A survey

a b c d e f g h IInterface 0 0 1 0 0 0 0 0 0User 0 1 1 0 1 0 0 0 0System 2 1 1 0 0 0 0 0 0Human 1 0 0 1 0 0 0 0 0Computer 0 1 0 1 0 0 0 0 0Response 0 1 0 0 1 0 0 0 0Time 0 1 0 0 1 0 0 0 0EPS 1 0 1 0 0 0 0 0 0Survey 0 1 0 0 0 0 0 0 1Trees 0 0 0 0 0 1 1 1 0Graph 0 0 0 0 0 0 1 1 1Minors 0 0 0 0 0 0 0 1 1

Documents

Page 31: Administrivia

04/19/23 20:40 31Copyright © Kambhampati / Weld 2002

Similarity Function

The similarity or closeness of a document d = ( w1, …, wi, …, wn )

with respect to a query (or another document) q = ( q1, …, qi, …, qn )

is computed using a similarity (distance) function.

Many similarity functions exist

Page 32: Administrivia

04/19/23 20:40 32Copyright © Kambhampati / Weld 2002

Euclidian Distance

• Given two document vectors d1 and d2

i

wiwiddDist 2)21()2,1(

Page 33: Administrivia

04/19/23 20:40 33Copyright © Kambhampati / Weld 2002

Dot Product Distance

sim(q, d) = dot(q, d) = q1 w1 + … + qn wn

Example: Suppose d = (0.2, 0, 0.3, 1) and

q = (0.75, 0.75, 0, 1), then

sim(q, d) = 0.15 + 0 + 0 + 1 = 1.15

Observations of the dot product function.• More terms in common => higher similarity

• For terms appearing in both q and d, – those with higher weights contribute more to sim(q, d)

• Favors long documents over short documents.

• The computed similarities have no clear upper bound.

Page 34: Administrivia

04/19/23 20:40 34Copyright © Kambhampati / Weld 2002

• Sim(q,dj) = cos()

= [vec(dj) vec(q)] / |dj| * |q|

= [ wij * wiq] / |dj| * |q|

• Since wij > 0 and wiq > 0,

0 <= sim(q,dj) <=1

• A document is retrieved even if it matches the query terms only partially

i

j

dj

q system

interfaceuser

a

c

b

||||)cos(

BA

BAAB

a b cInterface 0 0 1User 0 1 1System 2 1 1

A normalized similarity metric

Page 35: Administrivia

04/19/23 20:40 35Copyright © Kambhampati / Weld 2002

t1= databaset2=SQLt3=indext4=regressiont5=likelihoodt6=linear

Eucledian

Cosine

Comparison of Eucledianand Cosine distance metrics

Page 36: Administrivia

04/19/23 20:40 36Copyright © Kambhampati / Weld 2002

Answering Queries

t1= databaset2=SQLt3=indext4=regressiont5=likelihoodt6=linear

Given Q={database, index} = {1,0,1,0,0,0}

• Represent query as vector• Compute distances to all

documents• Rank according to distance• Example

– “database index”

Page 37: Administrivia

04/19/23 20:40 37Copyright © Kambhampati / Weld 2002

• Sim(q,dj) = [ wij * wiq] / |dj| * |q|

• How to compute the weights wij and wiq ?

Term Weights in the Vector Model

Page 38: Administrivia

04/19/23 20:40 38Copyright © Kambhampati / Weld 2002

– Simple frequencies favor common words • E.g. Query: The Computer Tomography

• A good weight must take into account two effects:– Intra-document contents (similarity)

• tf factor, the term frequency within a document

– Inter-document separation (dis-similarity)• idf factor, the inverse document frequency

wij = tf(i,j) * idf(i)

Page 39: Administrivia

04/19/23 20:40 39Copyright © Kambhampati / Weld 2002

• Let,– N be the total number of docs in the collection– ni be the number of docs which contain ki– freq(i,j) raw frequency of ki within dj

• A normalized tf factor is given by– f(i,j) = freq(i,j) / max(freq(i,j))

• where the maximum is computed over all terms which occur within the document dj

• The idf factor is computed as– idf(i) = log (N/ni)

• the log is used to make the values of tf and idf comparable. • Can be interpreted as the amount of information associated with the term ki.

TF-IDF

Page 40: Administrivia

04/19/23 20:40 40Copyright © Kambhampati / Weld 2002

• The best schemes use weights like – wij = f(i,j) * log(N/ni)– the strategy is called a tf-idf weighting scheme

• For the query term weights, several possibilities:– wiq = (0.5 + [0.5 * freq(i,q) / max(freq(i,q)]) * log(N/ni)

– Or use the IDF weights (to give preference to rare words)

– Or let the user give weights to reflect her real preferences

• Easier said than done... Users are often dunderheads..• Help them with “relevance feedback” techniques.

Document/Query Representation using TF-IDF

Page 41: Administrivia

04/19/23 20:40 41Copyright © Kambhampati / Weld 2002

t1= databaset2=SQLt3=indext4=regressiont5=likelihoodt6=linear

Given Q={database, index} = {1,0,1,0,0,0}

Note: In this case, the weights used in query were 1 for t1 and t3,and 0 for the rest.

Page 42: Administrivia

04/19/23 20:40 42Copyright © Kambhampati / Weld 2002

The Vector Model:Summary

• The vector model with tf-idf weights is ~ best– Usually as good as the known ranking alternatives. – Simple and fast to compute.

• Advantages:– term-weighting improves quality of the answer set– partial matching – cosine ranking formula sorts documents by query similarity

• Disadvantages:– Assumes independence of index terms– Does not handle synonymy/polysemy– Query weighting may not reflect user relevance criteria.