introducing information retrieval and web search · • antony and cleopatra, act iii, scene ii...

51
Introducing Information Retrieval and Web Search borrowing from: Pandu Nayak

Upload: others

Post on 10-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

IntroducingInformationRetrievalandWebSearch

borrowingfrom:PanduNayak

Page 2: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

InformationRetrieval

• InformationRetrieval(IR)isfindingmaterial(usuallydocuments)ofanunstructurednature(usuallytext)thatsatisfiesaninformationneedfromwithinlargecollections(usuallystoredoncomputers).

– Thesedayswefrequentlythinkfirstofwebsearch,buttherearemanyothercases:

• E-mailsearch• Searchingyourlaptop• Corporateknowledgebases• Legalinformationretrieval

2

Page 3: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

BasicassumptionsofInformationRetrieval

• Collection:Asetofdocuments– Assumeitisastaticcollectionforthemoment

• Goal:Retrievedocumentswithinformationthatisrelevanttotheuser’sinformationneedandhelpstheusercompleteatask

3

Sec. 1.1

Page 4: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

howtrapmicealive

Theclassicsearchmodel

Collection

User task

Info need

Query

Results

Search engine

Query refinement

Get rid of mice in a politically correct way

Info about removing mice without killing them

Misconception?

Misformulation?

Search

Page 5: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Howgoodaretheretrieveddocs?

▪ Precision:Fractionofretrieveddocsthatarerelevanttotheuser’sinformationneed

▪ Recall:Fractionofrelevantdocsincollectionthatareretrieved

▪ Moreprecisedefinitionsandmeasurementstofollowlater

5

Sec. 1.1

Page 6: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Term-documentincidencematrices

Page 7: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Unstructureddatain1620

• WhichplaysofShakespearecontainthewordsBrutusANDCaesarbutNOTCalpurnia?

• OnecouldgrepallofShakespeare’splaysforBrutusandCaesar,thenstripoutlinescontainingCalpurnia?

• Whyisthatnottheanswer?– Slow(forlargecorpora)– NOTCalpurniaisnon-trivial– Otheroperations(e.g.,findthewordRomansnearcountrymen)notfeasible

– Rankedretrieval(bestdocumentstoreturn)• Laterlectures

7

Sec. 1.1

Page 8: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Term-documentincidencematrices

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1worser 1 0 1 1 1 0

1 if play contains word, 0 otherwise

Brutus AND Caesar BUT NOT Calpurnia

Sec. 1.1

Page 9: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Incidencevectors

• Sowehavea0/1vectorforeachterm.• Toanswerquery:takethevectorsforBrutus,CaesarandCalpurnia(complemented)➔ bitwiseAND.– 110100AND– 110111AND– 101111=– 100100

9

Sec. 1.1

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1worser 1 0 1 1 1 0

Page 10: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Answerstoquery

• Antony and Cleopatra,Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.

• Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i’ the Capitol; Brutus killed me.

10

Sec. 1.1

Page 11: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Biggercollections

• ConsiderN=1milliondocuments,eachwithabout1000words.

• Avg6bytes/wordincludingspaces/punctuation– 6GBofdatainthedocuments.

• SaythereareM=500Kdistincttermsamongthese.

11

Sec. 1.1

Page 12: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Can’tbuildthematrix

• 500Kx1Mmatrixhashalf-a-trillion0’sand1’s.

• Butithasnomorethanonebillion1’s.– matrixisextremelysparse.

• What’sabetterrepresentation?– Weonlyrecordthe1positions.

12

Why?

Sec. 1.1

Page 13: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

TheInvertedIndexThekeydatastructureunderlying

modernIR

Page 14: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Invertedindex• Foreachtermt,wemuststorealistofalldocumentsthatcontaint.– IdentifyeachdocbyadocID,adocumentserialnumber

• Canweusedfixed-sizearraysforthis?

14

WhathappensifthewordCaesarisaddedtodocument14?

Sec. 1.2

Brutus

Calpurnia

Caesar 1 2 4 5 6 16 57 132

1 2 4 11 31 45173

2 31

174

54101

Page 15: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Invertedindex• Weneedvariable-sizepostingslists

– Ondisk,acontinuousrunofpostingsisnormalandbest

– Inmemory,canuselinkedlistsorvariablelengtharrays

• Sometradeoffsinsize/easeofinsertion

15

Dictionary PostingsSorted by docID (more later on why).

Posting

Sec. 1.2

Brutus

Calpurnia

Caesar 1 2 4 5 6 16 57 132

1 2 4 11 31 45173

2 31

174

54101

Page 16: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Tokenizer

Token stream Friends Romans Countrymen

Invertedindexconstruction

Linguisticmodules

Modified tokens friend roman countryman

Indexer

Inverted index

friend

roman

countryman

2 4

2

13 16

1

Documents to be indexed

Friends, Romans, countrymen.

Sec. 1.2

Page 17: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Initialstagesoftextprocessing

• Tokenization– Cutcharactersequenceintowordtokens

• Dealwith“John’s”,astate-of-the-artsolution

• Normalization– Maptextandquerytermtosameform

• YouwantU.S.A.andUSAtomatch

• Stemming– Wemaywishdifferentformsofaroottomatch

• authorize,authorization

• Stopwords– Wemayomitverycommonwords(ornot)

• the,a,to,of

Page 18: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Indexersteps:Tokensequence

• Sequenceof(Modifiedtoken,DocumentID)pairs.

I did enact Julius Caesar I was killed

i’ the Capitol; Brutus killed me.

Doc 1

So let it be with Caesar. The noble

Brutus hath told you Caesar was ambitious

Doc 2

Sec. 1.2

Page 19: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Indexersteps:Sort

• Sortbyterms– AndthendocID

Coreindexingstep

Sec. 1.2

Page 20: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Indexersteps:Dictionary&Postings

• Multipletermentriesinasingledocumentaremerged.

• SplitintoDictionaryandPostings

• Doc.frequencyinformationisadded.

Whyfrequency?Willdiscusslater.

Sec. 1.2

Page 21: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Wheredowepayinstorage?

21Pointers

Termsandcounts

IRsystemimplementation•Howdoweindexefficiently?•Howmuchstoragedoweneed?

Sec. 1.2

ListsofdocIDs

Page 22: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Queryprocessingwithaninvertedindex

Page 23: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Theindexwejustbuilt

• Howdoweprocessaquery?– Later-whatkindsofqueriescanweprocess?

23

Ourfocus

Sec. 1.3

Page 24: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Queryprocessing:AND

• Considerprocessingthequery:BrutusANDCaesar– LocateBrutusintheDictionary;

• Retrieveitspostings.– LocateCaesarintheDictionary;

• Retrieveitspostings.– “Merge”thetwopostings(intersectthedocumentsets):

24

12834

2 4 8 16 32 641 2 3 5 8 13 21

BrutusCaesar

Sec. 1.3

Page 25: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Themerge

• Walkthroughthetwopostingssimultaneously,intimelinearinthetotalnumberofpostingsentries

25

341282 4 8 16 32 64

1 2 3 5 8 13 21BrutusCaesar

If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.

Sec. 1.3

Page 26: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Intersectingtwopostingslists(a“merge”algorithm)

26

Page 27: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

TheBooleanRetrievalModel&ExtendedBooleanModels

Page 28: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Booleanqueries:Exactmatch

• TheBooleanretrievalmodelisbeingabletoaskaquerythatisaBooleanexpression:– BooleanQueriesarequeriesusingAND,ORandNOTtojoinqueryterms

• Viewseachdocumentasasetofwords• Isprecise:documentmatchesconditionornot.

– PerhapsthesimplestmodeltobuildanIRsystemon

• Primarycommercialretrievaltoolfor3decades.• ManysearchsystemsyoustilluseareBoolean:

– Email,librarycatalog,MacOSXSpotlight28

Sec. 1.3

Page 29: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Example:WestLawhttp://www.westlaw.com/

• Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992; new federated search added 2010)

• Tens of terabytes of data; ~700,000 users • Majority of users still use boolean queries • Example query:

– What is the statute of limitations in cases involving the federal tort claims act?

– LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM

• /3 = within 3 words, /S = in same sentence

29

Sec. 1.4

Page 30: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Example:WestLawhttp://www.westlaw.com/

• Anotherexamplequery:– Requirementsfordisabledpeopletobeabletoaccessaworkplace

– disabl!/paccess!/swork-sitework-place(employment/3place)

• NotethatSPACEisdisjunction,notconjunction!• Long,precisequeries;proximityoperators;incrementallydeveloped;notlikewebsearch

• ManyprofessionalsearchersstilllikeBooleansearch– Youknowexactlywhatyouaregetting

• Butthatdoesn’tmeanitactuallyworksbetter….

Sec. 1.4

Page 31: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Booleanqueries: Moregeneralmerges

• Exercise:Adaptthemergeforthequeries: BrutusANDNOTCaesar BrutusORNOTCaesar

• CanwestillrunthroughthemergeintimeO(x+y)?Whatcanweachieve?

31

Sec. 1.3

Page 32: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Queryoptimization

• Whatisthebestorderforqueryprocessing?• ConsideraquerythatisanANDofnterms.• Foreachofthenterms,getitspostings,thenANDthemtogether.

Brutus

CaesarCalpurnia

1 2 3 5 8 16 21 34

2 4 8 16 32 64128

13 16

Query:BrutusANDCalpurniaANDCaesar32

Sec. 1.3

Page 33: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Queryoptimizationexample

• Processinorderofincreasingfreq:– startwithsmallestset,thenkeepcuttingfurther.

33

Thisiswhywekeptdocumentfreq.indictionary

Executethequeryas(CalpurniaANDBrutus)ANDCaesar.

Sec. 1.3

Brutus

CaesarCalpurnia

1 2 3 5 8 16 21 34

2 4 8 16 32 64128

13 16

Page 34: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Moregeneraloptimization

• e.g.,(maddingORcrowd)AND(ignobleORstrife)• Getdoc.freq.’sforallterms.• EstimatethesizeofeachORbythesumofitsdoc.freq.’s(conservative).

• ProcessinincreasingorderofORsizes.

34

Sec. 1.3

Page 35: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Exercise

• Recommendaqueryprocessingorderfor

• Whichtwotermsshouldweprocessfirst?

Term Freq eyes 213312 kaleidoscope 87009 marmalade 107913 skies 271658 tangerine 46653 trees 316812

35

(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)

Page 36: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Phrasequeriesandpositionalindexes

Page 37: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Phrasequeries

• Wewanttobeabletoanswerqueriessuchas“stanforduniversity”–asaphrase

• Thusthesentence“IwenttouniversityatStanford”isnotamatch.– Theconceptofphrasequerieshasproveneasilyunderstoodbyusers;oneofthefew“advancedsearch”ideasthatworks

– Manymorequeriesareimplicitphrasequeries• Forthis,itnolongersufficestostoreonly<term:docs>entries

Sec. 2.4

Page 38: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Afirstattempt:Biwordindexes

• Indexeveryconsecutivepairoftermsinthetextasaphrase

• Forexamplethetext“Friends,Romans,Countrymen”wouldgeneratethebiwords– friendsromans– romanscountrymen

• Eachofthesebiwordsisnowadictionaryterm• Two-wordphrasequery-processingisnowimmediate.

Sec. 2.4.1

Page 39: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Longerphrasequeries

• Longerphrasescanbeprocessedbybreakingthemdown

• stanforduniversitypaloaltocanbebrokenintotheBooleanqueryonbiwords:

stanforduniversityANDuniversitypaloANDpaloalto

Withoutthedocs,wecannotverifythatthedocsmatchingtheaboveBooleanquerydocontainthephrase.

Canhavefalsepositives!

Sec. 2.4.1

Page 40: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Issuesforbiwordindexes

• Falsepositives,asnotedbefore• Indexblowupduetobiggerdictionary

– Infeasibleformorethanbiwords,bigevenforthem

• Biwordindexesarenotthestandardsolution(forallbiwords)butcanbepartofacompoundstrategy

Sec. 2.4.1

Page 41: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Solution2:Positionalindexes

• Inthepostings,store,foreachtermtheposition(s)inwhichtokensofitappear:

<term,numberofdocscontainingterm;doc1:position1,position2…;doc2:position1,position2…;etc.>

Sec. 2.4.2

Page 42: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Positionalindexexample

• Forphrasequeries,weuseamergealgorithmrecursivelyatthedocumentlevel

• Butwenowneedtodealwithmorethanjustequality

<be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …>

Which of docs 1,2,4,5 could contain “to be

or not to be”?

Sec. 2.4.2

Page 43: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Processingaphrasequery

• Extractinvertedindexentriesforeachdistinctterm:to,be,or,not.

• Mergetheirdoc:positionliststoenumerateallpositionswith“tobeornottobe”.– to:

• 2:1,17,74,222,551;4:8,16,190,429,433;7:13,23,191;...

– be:• 1:17,19;4:17,191,291,430,434;5:14,19,101;...

• Samegeneralmethodforproximitysearches

Sec. 2.4.2

Page 44: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Positionalindexsize

• Apositionalindexexpandspostingsstoragesubstantially– Eventhoughindicescanbecompressed

• Nevertheless,apositionalindexisnowstandardlyusedbecauseofthepowerandusefulnessofphraseandproximityqueries…whetherusedexplicitlyorimplicitlyinarankingretrievalsystem.

Sec. 2.4.2

Page 45: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Positionalindexsize

• Needanentryforeachoccurrence,notjustonceperdocument

• Indexsizedependsonaveragedocumentsize– Averagewebpagehas<1000terms– SECfilings,books,evensomeepicpoems…easily100,000terms

• Consideratermwithfrequency0.1%

Why?

1001100,000

111000

PositionalpostingsPostingsDocumentsize

Sec. 2.4.2

Page 46: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Rulesofthumb

• Apositionalindexis2–4aslargeasanon-positionalindex

• Positionalindexsize35–50%ofvolumeoforiginaltext

– Caveat:allofthisholdsfor“English-like”languages

Sec. 2.4.2

Page 47: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Combinationschemes

• Thesetwoapproachescanbeprofitablycombined– Forparticularphrases(“MichaelJackson”,“BritneySpears”)itisinefficienttokeeponmergingpositionalpostingslists

• Evenmoresoforphraseslike“TheWho”

• Williamsetal.(2004)evaluateamoresophisticatedmixedindexingscheme– Atypicalwebquerymixturewasexecutedin¼ofthetimeofusingjustapositionalindex

– Itrequired26%morespacethanhavingapositionalindexalone

Sec. 2.4.3

Page 48: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Structuredvs.UnstructuredData

Page 49: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

IRvs.databases:Structuredvsunstructureddata

• Structureddatatendstorefertoinformationin“tables”

49

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

50000Ivy Smith

Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith.

Page 50: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Unstructureddata

• Typicallyreferstofreetext• Allows

– Keywordqueriesincludingoperators– Moresophisticated“concept”queriese.g.,

• findallwebpagesdealingwithdrugabuse

• Classicmodelforsearchingtextdocuments

50

Page 51: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar

Semi-structureddata

• Infactalmostnodatais“unstructured”• E.g.,thisslidehasdistinctlyidentifiedzonessuchastheTitleandBullets

• …tosaynothingoflinguisticstructure

• Facilitates“semi-structured”searchsuchas– TitlecontainsdataANDBulletscontainsearch

• Oreven– TitleisaboutObjectOrientedProgrammingANDAuthorsomethinglikestro*rup

– where*isthewild-cardoperator

51