itcs 6265 ir & web mining itcs 6265/8265: advanced topics in kdd --- information retrieval and...

37
ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall 2009

Upload: britton-maxwell

Post on 04-Jan-2016

220 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

ITCS 6265/8265: Advanced Topics in KDD ---

Information Retrieval and Web MiningLecture 1

Boolean retrieval

UNC Charlotte, Fall 2009

Page 2: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Course administrivia Web site: http://www.cs.uncc.edu/~wwu18/itcs6265/

Work/Grading: Assignments 20% Project 15% Paper presentation 5% Midterm 30% Final 30%

Required textbook: Introduction to Information Retrieval, by Manning et al.

2

    

Page 3: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Information Retrieval Information Retrieval (IR) is finding material (usually

documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

Unstructured data: structure & semantics of data are implicit, not specified to ease computer processing

Structured data, e.g., records in relational databases

3

Page 4: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Unstructured data in 1680 Which plays of Shakespeare contain the words

Brutus AND Caesar but NOT Calpurnia? One could grep all of Shakespeare’s plays for Brutus

and Caesar, then strip out lines containing Calpurnia? Slow (for large corpora, e.g., billions or trillions of words) Other operations (e.g., find the word Romans near

countrymen) not feasible Ranked retrieval (best documents to return)

Later lectures

Need index!4

Sec. 1.1

Page 5: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Take one: Index in term-document incidence matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

1 if play contains word, 0 otherwise

Brutus AND Caesar but NOT Calpurnia

Sec. 1.1

Documents (plays)

Words(~32k)

Page 6: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Term vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar

and Calpurnia (complemented) bitwise AND. 110100 AND 110111 AND 101111 = 100100.

6

Sec. 1.1

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Page 7: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Answers to query

Antony and Cleopatra (Act III, Scene ii)Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,

When Antony found Julius Caesar dead,

He cried almost to roaring; and he wept

When at Philippi he found Brutus slain.

Hamlet (Act III, Scene ii)Lord Polonius: I did enact Julius Caesar: I was killed i' the

Capitol; Brutus killed me.

7

Sec. 1.1

Page 8: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Basic assumptions of Information Retrieval Collection: Fixed set of documents Goal: Retrieve documents with information that is

relevant to user’s information need and helps him complete a task

Note: information need is topic of interest, different from query --- which are formulated by users based on their information need, & then submitted to retrieval systems to find relevant documents

But many things might go wrong in this process …

8

Sec. 1.1

Page 9: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

The classic search model

Corpus

TASK

Info Need

Query

Verbal form

Results

SEARCHENGINE

QueryRefinement

Get rid of mice in a politically correct way

Info about removing micewithout killing them

How do I trap mice alive?

mouse trap

Mis-conception

Mis-translation

Mis-formulation

Page 10: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

How good are the retrieved docs? Precision : Fraction of retrieved docs that are

relevant to user’s information need Recall : Fraction of relevant docs in collection that are

retrieved More precise definitions and measurements to

follow in later lectures

10

Sec. 1.1

Page 11: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Can matrix scale to larger collection? Shakespeare’ collective works: ~36 plays

Now consider N = 1 million documents, each with about 1000 words. Note that this is still several orders of magnitude smaller than the size of Web!

Avg 6 bytes/word including spaces/punctuation 6GB (= 1M * 1000 * 6) of data in the documents.

Say there are M = 500K distinct terms among these.

11

Page 12: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Can’t build the matrix 500K x 1M matrix has half-a-trillion 0’s and 1’s. But it has no more than one billion 1’s.

matrix is extremely sparse. i.e., has a lot of zeros

What would be a better representation? We only record the 1 positions.

12

Why?

Page 13: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Take two: Inverted index For each term T, we must store a list of all

documents that contain T. Do we use an array or a list for this?

13

Brutus

Calpurnia

Caesar

1 2 3 5 8 13 21 34

2 4 8 16 32 64128

13 16

What happens if the word Caesar is added to document 14?

Sec. 1.2

Sorted by docID (more later on why).

Page 14: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Inverted index Linked lists generally preferred to arrays

Dynamic space allocation Insertion of terms into documents easy

But have space overhead due to pointers

14

Brutus

Calpurnia

Caesar

2 4 8 16 32 64 128

2 3 5 8 13 21 34

13 16

1

Dictionary Postings lists

PostingPosting

Sec. 1.2

Page 15: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Tokenizer

Token stream. Friends Romans Countrymen

Inverted index construction

Linguistic modules

Normalized tokens(index terms).

friend roman countryman

Indexer

Inverted index.

friend

roman

countryman

2 4

2

13 16

1

More onthese later.

Documents tobe indexed.

Friends, Romans, countrymen.

Sec. 1.2

Page 16: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Indexer steps Sequence of (Normalized token, Document ID) pairs.

I did enact JuliusCaesar I was killed

i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The noble

Brutus hath told youCaesar was ambitious

Doc 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2

caesar 2was 2ambitious 2

Sec. 1.2

Page 17: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Sorting Sort by terms & then by doc ID’s

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

Core indexing step.

Sec. 1.2

Page 18: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Merging Multiple term entries in the

same document are merged.

Frequency information is added.

Term Doc # Term freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Why frequency?Will discuss later.

Sec. 1.2

Page 19: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Grouping (by term) & splitting (into dictionary and postings lists)

Doc # Term freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1

Term Doc freq Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1

Term Doc # Term freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Sec. 1.2

Page 20: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Where do we pay in storage?

20

Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1

Term N docs Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1

Pointers

Terms

Will quantify the storage, later.

Sec. 1.2

Page 21: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

The index we just built How do we process a query?

Later - what kinds of queries can we process?

21

Today’s focus

Sec. 1.3

Page 22: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Query processing: AND Consider processing the query:

Brutus AND Caesar Locate Brutus in the Dictionary;

Retrieve its postings. Locate Caesar in the Dictionary;

Retrieve its postings. Intersect (sometimes called “merge”) the two postings:

22

128

34

2 4 8 16 32 64

1 2 3 5 8 13

21

Brutus

Caesar

Sec. 1.3

Page 23: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

The Intersect/Merge algorithm

Page 24: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

The complexity of merge Walk through the two postings simultaneously, in

time linear in the total number of postings entries

24

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

Brutus

Caesar2 8

If the list lengths are x and y, the merge takes O(x+y)operations.Crucial: postings sorted by docID.

Sec. 1.3

Page 25: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Boolean queries: Exact match

The Boolean Retrieval model is being able to ask a query that is a Boolean expression: Boolean Queries are queries using AND, OR and NOT to

join query terms Views each document as a set of words Is precise: document matches condition or not.

Primary commercial retrieval tool for 3 decades. Professional searchers (e.g., lawyers) still like Boolean

queries: You know exactly what you’re getting.

25

Sec. 1.3

Page 26: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Example: WestLaw http://www.westlaw.com/

http://library.uncc.edu/electronic/display.php?resource=434 (accessible to uncc)

Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)

Tens of terabytes of data; 700,000 users Majority of users still use boolean queries Example:

Information need: What is the statute of limitations in cases involving the federal tort claims act?

Query: LIMIT! /3 STATUTE ACT /S FEDERAL /2 TORT /3 CLAIM

/3 = within 3 words, /S = in same sentence 26

Sec. 1.4

Page 27: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Example: WestLaw http://www.westlaw.com/

Another example: Information need: Requirements for disabled people

to be able to access a workplace Query: disabl! /p access! /s work-site work-place

(employment /3 place)

Note that SPACE is disjunction, not conjunction! Long, precise queries; proximity operators;

incrementally developed; not like web search Professional searchers often like Boolean search:

Precision, transparency and control But that doesn’t mean they actually work better….

Sec. 1.4

Page 28: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Boolean queries: More general merges Exercise: Adapt the merge for the queries:

Brutus AND NOT CaesarBrutus OR NOT Caesar

Can we still run through the merge in time O(x+y)?What can we achieve?

28

Sec. 1.3

Page 29: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

MergingWhat about an arbitrary Boolean formula?(Brutus OR Caesar) AND NOT(Antony OR Cleopatra) Can we always merge in “linear” time?

Linear in what? Can we do better?

29

Sec. 1.3

Page 30: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Query optimization

What is the best order for query processing? Consider a query that is an AND of t terms. For each of the t terms, get its postings, then AND

them together.

Brutus

Calpurnia

Caesar

1 2 3 5 8 16 21 34

2 4 8 16 32 64128

13 16

Query: Brutus AND Calpurnia AND Caesar30

Sec. 1.3

Page 31: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Query optimization example Process in order of increasing freq:

start with smallest set, then keep cutting further.

31

Brutus

Calpurnia

Caesar

1 2 3 5 8 13 21 34

2 4 8 16 32 64128

13 16

This is why we keptfreq in dictionary

Execute the query as (Caesar AND Brutus) AND Calpurnia.

Sec. 1.3

Page 32: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

More general optimization e.g., (madding OR crowd) AND (ignoble OR

strife) Get freq’s for all terms. Estimate the size of each OR by the sum of its

freq’s (conservative). Process in increasing order of OR sizes.

32

Sec. 1.3

Page 33: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Exercise

Recommend a query processing order for

Term Freq eyes 213312

kaleidoscope 87009

marmalade 107913

skies 271658

tangerine 46653

trees 316812

33

(tangerine OR trees) AND(marmalade OR skies) AND(kaleidoscope OR eyes)

Page 34: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

What is ahead? Part I: IR & Search Engines

Positional index: phrase/proximity queries Tolerant retrieval: wildcard, spelling correction, etc. Efficient index construction Index compression Ranked retrieval: term weighting, vector space

model Evaluation of IR system Relevance feedback & query expansion Web search, crawling, and indexing

34

Page 35: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Part II: Text mining techniques Classification: given a set of topics, plus a new doc D,

decide which topic(s) D belongs to. Naïve Bayes classifier, vector space classifier, support vector machine

Clustering: given a set of docs, group them into clusters based on their contents. flat, hierarchical clustering, latent semantic indexing

35

Page 36: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Other text mining tasks (if time permits) Entity extraction Summarization Opinion retrieval Sentiment analysis …

36

Page 37: ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall

ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining

Part III: Web data mining Content mining

Web data extraction, web data integration, etc.

Linkage analysis & structure mining Authority, hub, pagerank, community discovery, etc

Usage mining Web/application server log, user browsing history, etc.