introducing information retrieval and web search · • antony and cleopatra, act iii, scene ii...
TRANSCRIPT
IntroducingInformationRetrievalandWebSearch
borrowingfrom:PanduNayak
InformationRetrieval
• InformationRetrieval(IR)isfindingmaterial(usuallydocuments)ofanunstructurednature(usuallytext)thatsatisfiesaninformationneedfromwithinlargecollections(usuallystoredoncomputers).
– Thesedayswefrequentlythinkfirstofwebsearch,buttherearemanyothercases:
• E-mailsearch• Searchingyourlaptop• Corporateknowledgebases• Legalinformationretrieval
2
BasicassumptionsofInformationRetrieval
• Collection:Asetofdocuments– Assumeitisastaticcollectionforthemoment
• Goal:Retrievedocumentswithinformationthatisrelevanttotheuser’sinformationneedandhelpstheusercompleteatask
3
Sec. 1.1
howtrapmicealive
Theclassicsearchmodel
Collection
User task
Info need
Query
Results
Search engine
Query refinement
Get rid of mice in a politically correct way
Info about removing mice without killing them
Misconception?
Misformulation?
Search
Howgoodaretheretrieveddocs?
▪ Precision:Fractionofretrieveddocsthatarerelevanttotheuser’sinformationneed
▪ Recall:Fractionofrelevantdocsincollectionthatareretrieved
▪ Moreprecisedefinitionsandmeasurementstofollowlater
5
Sec. 1.1
Term-documentincidencematrices
Unstructureddatain1620
• WhichplaysofShakespearecontainthewordsBrutusANDCaesarbutNOTCalpurnia?
• OnecouldgrepallofShakespeare’splaysforBrutusandCaesar,thenstripoutlinescontainingCalpurnia?
• Whyisthatnottheanswer?– Slow(forlargecorpora)– NOTCalpurniaisnon-trivial– Otheroperations(e.g.,findthewordRomansnearcountrymen)notfeasible
– Rankedretrieval(bestdocumentstoreturn)• Laterlectures
7
Sec. 1.1
Term-documentincidencematrices
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1worser 1 0 1 1 1 0
1 if play contains word, 0 otherwise
Brutus AND Caesar BUT NOT Calpurnia
Sec. 1.1
Incidencevectors
• Sowehavea0/1vectorforeachterm.• Toanswerquery:takethevectorsforBrutus,CaesarandCalpurnia(complemented)➔ bitwiseAND.– 110100AND– 110111AND– 101111=– 100100
9
Sec. 1.1
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1worser 1 0 1 1 1 0
Answerstoquery
• Antony and Cleopatra,Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.
• Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i’ the Capitol; Brutus killed me.
10
Sec. 1.1
Biggercollections
• ConsiderN=1milliondocuments,eachwithabout1000words.
• Avg6bytes/wordincludingspaces/punctuation– 6GBofdatainthedocuments.
• SaythereareM=500Kdistincttermsamongthese.
11
Sec. 1.1
Can’tbuildthematrix
• 500Kx1Mmatrixhashalf-a-trillion0’sand1’s.
• Butithasnomorethanonebillion1’s.– matrixisextremelysparse.
• What’sabetterrepresentation?– Weonlyrecordthe1positions.
12
Why?
Sec. 1.1
TheInvertedIndexThekeydatastructureunderlying
modernIR
Invertedindex• Foreachtermt,wemuststorealistofalldocumentsthatcontaint.– IdentifyeachdocbyadocID,adocumentserialnumber
• Canweusedfixed-sizearraysforthis?
14
WhathappensifthewordCaesarisaddedtodocument14?
Sec. 1.2
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45173
2 31
174
54101
Invertedindex• Weneedvariable-sizepostingslists
– Ondisk,acontinuousrunofpostingsisnormalandbest
– Inmemory,canuselinkedlistsorvariablelengtharrays
• Sometradeoffsinsize/easeofinsertion
15
Dictionary PostingsSorted by docID (more later on why).
Posting
Sec. 1.2
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45173
2 31
174
54101
Tokenizer
Token stream Friends Romans Countrymen
Invertedindexconstruction
Linguisticmodules
Modified tokens friend roman countryman
Indexer
Inverted index
friend
roman
countryman
2 4
2
13 16
1
Documents to be indexed
Friends, Romans, countrymen.
Sec. 1.2
Initialstagesoftextprocessing
• Tokenization– Cutcharactersequenceintowordtokens
• Dealwith“John’s”,astate-of-the-artsolution
• Normalization– Maptextandquerytermtosameform
• YouwantU.S.A.andUSAtomatch
• Stemming– Wemaywishdifferentformsofaroottomatch
• authorize,authorization
• Stopwords– Wemayomitverycommonwords(ornot)
• the,a,to,of
Indexersteps:Tokensequence
• Sequenceof(Modifiedtoken,DocumentID)pairs.
I did enact Julius Caesar I was killed
i’ the Capitol; Brutus killed me.
Doc 1
So let it be with Caesar. The noble
Brutus hath told you Caesar was ambitious
Doc 2
Sec. 1.2
Indexersteps:Sort
• Sortbyterms– AndthendocID
Coreindexingstep
Sec. 1.2
Indexersteps:Dictionary&Postings
• Multipletermentriesinasingledocumentaremerged.
• SplitintoDictionaryandPostings
• Doc.frequencyinformationisadded.
Whyfrequency?Willdiscusslater.
Sec. 1.2
Wheredowepayinstorage?
21Pointers
Termsandcounts
IRsystemimplementation•Howdoweindexefficiently?•Howmuchstoragedoweneed?
Sec. 1.2
ListsofdocIDs
Queryprocessingwithaninvertedindex
Theindexwejustbuilt
• Howdoweprocessaquery?– Later-whatkindsofqueriescanweprocess?
23
Ourfocus
Sec. 1.3
Queryprocessing:AND
• Considerprocessingthequery:BrutusANDCaesar– LocateBrutusintheDictionary;
• Retrieveitspostings.– LocateCaesarintheDictionary;
• Retrieveitspostings.– “Merge”thetwopostings(intersectthedocumentsets):
24
12834
2 4 8 16 32 641 2 3 5 8 13 21
BrutusCaesar
Sec. 1.3
Themerge
• Walkthroughthetwopostingssimultaneously,intimelinearinthetotalnumberofpostingsentries
25
341282 4 8 16 32 64
1 2 3 5 8 13 21BrutusCaesar
If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.
Sec. 1.3
Intersectingtwopostingslists(a“merge”algorithm)
26
TheBooleanRetrievalModel&ExtendedBooleanModels
Booleanqueries:Exactmatch
• TheBooleanretrievalmodelisbeingabletoaskaquerythatisaBooleanexpression:– BooleanQueriesarequeriesusingAND,ORandNOTtojoinqueryterms
• Viewseachdocumentasasetofwords• Isprecise:documentmatchesconditionornot.
– PerhapsthesimplestmodeltobuildanIRsystemon
• Primarycommercialretrievaltoolfor3decades.• ManysearchsystemsyoustilluseareBoolean:
– Email,librarycatalog,MacOSXSpotlight28
Sec. 1.3
Example:WestLawhttp://www.westlaw.com/
• Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992; new federated search added 2010)
• Tens of terabytes of data; ~700,000 users • Majority of users still use boolean queries • Example query:
– What is the statute of limitations in cases involving the federal tort claims act?
– LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
• /3 = within 3 words, /S = in same sentence
29
Sec. 1.4
Example:WestLawhttp://www.westlaw.com/
• Anotherexamplequery:– Requirementsfordisabledpeopletobeabletoaccessaworkplace
– disabl!/paccess!/swork-sitework-place(employment/3place)
• NotethatSPACEisdisjunction,notconjunction!• Long,precisequeries;proximityoperators;incrementallydeveloped;notlikewebsearch
• ManyprofessionalsearchersstilllikeBooleansearch– Youknowexactlywhatyouaregetting
• Butthatdoesn’tmeanitactuallyworksbetter….
Sec. 1.4
Booleanqueries: Moregeneralmerges
• Exercise:Adaptthemergeforthequeries: BrutusANDNOTCaesar BrutusORNOTCaesar
• CanwestillrunthroughthemergeintimeO(x+y)?Whatcanweachieve?
31
Sec. 1.3
Queryoptimization
• Whatisthebestorderforqueryprocessing?• ConsideraquerythatisanANDofnterms.• Foreachofthenterms,getitspostings,thenANDthemtogether.
Brutus
CaesarCalpurnia
1 2 3 5 8 16 21 34
2 4 8 16 32 64128
13 16
Query:BrutusANDCalpurniaANDCaesar32
Sec. 1.3
Queryoptimizationexample
• Processinorderofincreasingfreq:– startwithsmallestset,thenkeepcuttingfurther.
33
Thisiswhywekeptdocumentfreq.indictionary
Executethequeryas(CalpurniaANDBrutus)ANDCaesar.
Sec. 1.3
Brutus
CaesarCalpurnia
1 2 3 5 8 16 21 34
2 4 8 16 32 64128
13 16
Moregeneraloptimization
• e.g.,(maddingORcrowd)AND(ignobleORstrife)• Getdoc.freq.’sforallterms.• EstimatethesizeofeachORbythesumofitsdoc.freq.’s(conservative).
• ProcessinincreasingorderofORsizes.
34
Sec. 1.3
Exercise
• Recommendaqueryprocessingorderfor
• Whichtwotermsshouldweprocessfirst?
Term Freq eyes 213312 kaleidoscope 87009 marmalade 107913 skies 271658 tangerine 46653 trees 316812
35
(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)
Phrasequeriesandpositionalindexes
Phrasequeries
• Wewanttobeabletoanswerqueriessuchas“stanforduniversity”–asaphrase
• Thusthesentence“IwenttouniversityatStanford”isnotamatch.– Theconceptofphrasequerieshasproveneasilyunderstoodbyusers;oneofthefew“advancedsearch”ideasthatworks
– Manymorequeriesareimplicitphrasequeries• Forthis,itnolongersufficestostoreonly<term:docs>entries
Sec. 2.4
Afirstattempt:Biwordindexes
• Indexeveryconsecutivepairoftermsinthetextasaphrase
• Forexamplethetext“Friends,Romans,Countrymen”wouldgeneratethebiwords– friendsromans– romanscountrymen
• Eachofthesebiwordsisnowadictionaryterm• Two-wordphrasequery-processingisnowimmediate.
Sec. 2.4.1
Longerphrasequeries
• Longerphrasescanbeprocessedbybreakingthemdown
• stanforduniversitypaloaltocanbebrokenintotheBooleanqueryonbiwords:
stanforduniversityANDuniversitypaloANDpaloalto
Withoutthedocs,wecannotverifythatthedocsmatchingtheaboveBooleanquerydocontainthephrase.
Canhavefalsepositives!
Sec. 2.4.1
Issuesforbiwordindexes
• Falsepositives,asnotedbefore• Indexblowupduetobiggerdictionary
– Infeasibleformorethanbiwords,bigevenforthem
• Biwordindexesarenotthestandardsolution(forallbiwords)butcanbepartofacompoundstrategy
Sec. 2.4.1
Solution2:Positionalindexes
• Inthepostings,store,foreachtermtheposition(s)inwhichtokensofitappear:
<term,numberofdocscontainingterm;doc1:position1,position2…;doc2:position1,position2…;etc.>
Sec. 2.4.2
Positionalindexexample
• Forphrasequeries,weuseamergealgorithmrecursivelyatthedocumentlevel
• Butwenowneedtodealwithmorethanjustequality
<be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …>
Which of docs 1,2,4,5 could contain “to be
or not to be”?
Sec. 2.4.2
Processingaphrasequery
• Extractinvertedindexentriesforeachdistinctterm:to,be,or,not.
• Mergetheirdoc:positionliststoenumerateallpositionswith“tobeornottobe”.– to:
• 2:1,17,74,222,551;4:8,16,190,429,433;7:13,23,191;...
– be:• 1:17,19;4:17,191,291,430,434;5:14,19,101;...
• Samegeneralmethodforproximitysearches
Sec. 2.4.2
Positionalindexsize
• Apositionalindexexpandspostingsstoragesubstantially– Eventhoughindicescanbecompressed
• Nevertheless,apositionalindexisnowstandardlyusedbecauseofthepowerandusefulnessofphraseandproximityqueries…whetherusedexplicitlyorimplicitlyinarankingretrievalsystem.
Sec. 2.4.2
Positionalindexsize
• Needanentryforeachoccurrence,notjustonceperdocument
• Indexsizedependsonaveragedocumentsize– Averagewebpagehas<1000terms– SECfilings,books,evensomeepicpoems…easily100,000terms
• Consideratermwithfrequency0.1%
Why?
1001100,000
111000
PositionalpostingsPostingsDocumentsize
Sec. 2.4.2
Rulesofthumb
• Apositionalindexis2–4aslargeasanon-positionalindex
• Positionalindexsize35–50%ofvolumeoforiginaltext
– Caveat:allofthisholdsfor“English-like”languages
Sec. 2.4.2
Combinationschemes
• Thesetwoapproachescanbeprofitablycombined– Forparticularphrases(“MichaelJackson”,“BritneySpears”)itisinefficienttokeeponmergingpositionalpostingslists
• Evenmoresoforphraseslike“TheWho”
• Williamsetal.(2004)evaluateamoresophisticatedmixedindexingscheme– Atypicalwebquerymixturewasexecutedin¼ofthetimeofusingjustapositionalindex
– Itrequired26%morespacethanhavingapositionalindexalone
Sec. 2.4.3
Structuredvs.UnstructuredData
IRvs.databases:Structuredvsunstructureddata
• Structureddatatendstorefertoinformationin“tables”
49
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
50000Ivy Smith
Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith.
Unstructureddata
• Typicallyreferstofreetext• Allows
– Keywordqueriesincludingoperators– Moresophisticated“concept”queriese.g.,
• findallwebpagesdealingwithdrugabuse
• Classicmodelforsearchingtextdocuments
50
Semi-structureddata
• Infactalmostnodatais“unstructured”• E.g.,thisslidehasdistinctlyidentifiedzonessuchastheTitleandBullets
• …tosaynothingoflinguisticstructure
• Facilitates“semi-structured”searchsuchas– TitlecontainsdataANDBulletscontainsearch
• Oreven– TitleisaboutObjectOrientedProgrammingANDAuthorsomethinglikestro*rup
– where*isthewild-cardoperator
51