text processing - ir.cis.udel.edu
TRANSCRIPT
![Page 1: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/1.jpg)
3/17/09
1
TextProcessing
CISC489/689‐010,Lecture#3Monday,Feb.16
BenCartereFe
Indexing
• Anindexisalistofthings(keys)withpointerstootherthings(items).– Keywordscatalognumbers(shelves).– Conceptspagenumbers.– Termsdocuments.
• Needforindexes:– Easeofuse.– Speed.– Scalability.
![Page 2: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/2.jpg)
3/17/09
2
Manualvs.AutomaVcIndexing
• Manual:– An“expert”assignskeystoeachitem.
– Example:cardcatalog.
• AutomaVc:– KeysautomaVcallyidenVfiedandassigned.– Example:Google.
• AutomaVcasgoodasmanualformostpurposes.
TextProcessing
• FirststepinautomaVcindexing.• ConverVngdocumentsintoindex terms.
• Termsarenotjustwords.– Notallwordsareofequalvalueinasearch.– SomeVmesnotclearwherewordsbeginandend.
• Especiallywhennotspace‐separated,e.g.Chinese,Korean.
– Matchingtheexactwordstypedbytheuserdoesn’tworkverywellintermsofeffecVveness.
![Page 3: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/3.jpg)
3/17/09
3
TextProcessingSteps
• Foreachdocument:– Parseittolocatethepartsthatareimportant.
– Segmentandtokenizethetextintheimportantpartstogetwords.
– Removestop words.– Stemwordstocommonroots.
• Advancedprocessingmayincludedphrases,enVtytagging,link‐graphfeatures,andmore.
Parsing
• Somepartsofadocumentaremoreimportantthanothers.
• Documentparserrecognizesstructureusingmarkup suchasHTMLtags.– Headers,anchortext,boldedtextarelikelytobeimportant.
– JavaScript,styleinformaVon,navigaVonlinkslesslikelytobeimportant.
– Metadatacanalsobeimportant.
![Page 4: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/4.jpg)
3/17/09
4
ExampleWikipediaPage
WikipediaMarkup<title>Tropical fish</title> <text>{{Unreferenced|date=July 2008}} {{Original research|date=July 2008}} ’’’Tropical fish’’’ include [[fish]] found in [[Tropics|
topical]] environments around the world, including both [[fresh water|freshwater]] and [[sea water|salt water]] species. [[Fishkeeping|Fishkeepers]] often use the term ’’tropical fish’’ to refer only those requiring fresh water, with saltwater tropical fish referred to as ’’[[list of marine aquarium fish species|marine fish]]’’.
…
![Page 5: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/5.jpg)
3/17/09
5
WikipediaHTML
DocumentParsing
• HTMLpagesorganizeintotrees.
<HTML>
<HEAD>
<TITLE> Tropicalfish
<META>
<BODY>
<H1> Tropicalfish
<P>
<B> Tropicalfish
<A> fish
<A> tropical
includefoundinenvironmentsaroundtheworld
Nodes contain blocks of text.
![Page 6: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/6.jpg)
3/17/09
6
EndResultofParsing
• Blocksoftextfromimportantpartsofpage.– Tropicalfishincludefishfoundintropicalenvironmentsaroundtheworld,includingbothfreshwaterandsaltwaterspecies.Fishkeepersoienusetheterm“tropicalfish”toreferonlythoserequiringfreshwater,withsaltwatertropicalfishreferredtoas“marinefish”.
• Nextstep:segmenVngandtokenizing.
Tokenizing
• Formingwordsfromsequenceofcharactersinblocksoftext.
• SurprisinglycomplexinEnglish,canbeharderinotherlanguages.
• EarlyIRsystems:– Anysequenceofalphanumericcharactersoflength3ormore.
– Terminatedbyaspaceorotherspecialcharacter.
– Upper‐casechangedtolower‐case.
![Page 7: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/7.jpg)
3/17/09
7
Tokenizing
• Example:– “Bigcorp's2007bi‐annualreportshowedprofitsrose10%.”becomes
– “bigcorp2007annualreportshowedprofitsrose”• ToosimpleforsearchapplicaVonsorevenlarge‐scaleexperiments
• Why?ToomuchinformaVonlost– SmalldecisionsintokenizingcanhavemajorimpactoneffecVvenessofsomequeries
TokenizingProblems• Smallwordscanbeimportantinsomequeries,usuallyincombinaVons
• xp,ma,pm,beneking,elpaso,masterp,gm,jlo,worldwarII
• Bothhyphenatedandnon‐hyphenatedformsofmanywordsarecommon– SomeVmeshyphenisnotneeded
• e‐bay,wal‐mart,acVve‐x,cd‐rom,t‐shirts
– AtotherVmes,hyphensshouldbeconsideredeitheraspartofthewordorawordseparator
• winston‐salem,mazdarx‐7,e‐cards,pre‐diabetes,t‐mobile,spanish‐speaking
![Page 8: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/8.jpg)
3/17/09
8
TokenizingProblems
• Specialcharactersareanimportantpartoftags,URLs,codeindocuments
• Capitalizedwordscanhavedifferentmeaningfromlowercasewords– Bush,Apple
• Apostrophescanbeapartofaword,apartofapossessive,orjustamistake– rosieo'donnell,can't,don't,80's,1890's,men'sstrawhats,master'sdegree,england'stenlargestciVes,shriner's
TokenizingProblems
• Numberscanbeimportant,includingdecimals– nokia3250,top10courses,united93,quickVme6.5pro,92.3thebeat,288358
• Periodscanoccurinnumbers,abbreviaVons,URLs,endsofsentences,andothersituaVons– I.B.M.,Ph.D.,cis.udel.edu
• Note:tokenizingstepsforqueriesmustbeidenVcaltostepsfordocuments
![Page 9: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/9.jpg)
3/17/09
9
TokenizingProcess
• Assumewehaveusedtheparsertofindblocksofimportanttext.
• Awordmaybeanysequenceofalphanumericcharactersterminatedbyaspaceorspecialcharacter.– everythingconvertedtolowercase.– everythingindexed.
• Defercomplexdecisionstoothercomponents– example:92.3→923butsearchfindsdocumentswith92and3adjacent
– incorporatesomerulestoreducedependenceonquerytransformaVoncomponents
EndResultofTokenizaVon
• Listofwordsinblocksoftext.– tropicalfishincludefishfoundintropicalenvironmentsaroundtheworldincludingbothfreshwaterandsaltwaterspeciesfishkeepersoienusethetermtropicalfishtoreferonlythoserequiringfreshwaterwithsaltwatertropicalfishreferredtoasmarinefish
• Nextstep:stopping.• Butfirst:textstaVsVcs.
![Page 10: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/10.jpg)
3/17/09
10
TextStaVsVcs
• Hugevarietyofwordsusedintextbut• ManystaVsVcalcharacterisVcsofwordoccurrencesarepredictable– e.g.,distribuVonofwordcounts
• RetrievalmodelsandrankingalgorithmsdependheavilyonstaVsVcalproperVesofwords– e.g.,importantwordsoccuroienindocumentsbutarenothighfrequencyincollecVon
Zipf’sLaw• DistribuVonofwordfrequenciesisveryskewed
– afewwordsoccurveryoien,manywordshardlyeveroccur
– e.g.,twomostcommonwords(“the”,“of”)makeupabout10%ofallwordoccurrencesintextdocuments
• Zipf’s“law”:– observaVonthatrank(r)ofawordVmesitsfrequency(f)isapproximatelyaconstant(k)
• assumingwordsarerankedinorderofdecreasingfrequency
– i.e.,r.f ≈korr.Pr≈c,wherePrisprobabilityofwordoccurrenceandc≈ 0.1forEnglish
![Page 11: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/11.jpg)
3/17/09
11
Zipf’sLaw
WikipediaStaVsVcs(wiki000subset)
Totaldocuments 5,001
Totalwordoccurrences 22,545,922
Vocabularysize 348,436
Wordsoccurring>1000Vmes 2,751
Wordsoccurringonce 163,404
Word Freq r Pr(%) r.Pr
poliVcian 5096 510 0.023 0.116
contractor 100 14,852 4.4∙10‐4 0.066
kickboxer 10 56,125 4.4∙10‐5 0.025
comdedian 1 185,035 4.4∙10‐6 0.008
![Page 12: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/12.jpg)
3/17/09
12
Top50Wordsfromwiki000Subset
Zipf’sLawforwiki000Subset
Rank
Pro
babi
lity
![Page 13: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/13.jpg)
3/17/09
13
Zipf’sLaw
• WhatistheproporVonofwordswithagivenfrequency?– Wordthatoccursn Vmeshasrankrn = k/n – Numberofwordswithfrequencyn is
• rn − rn+1 = k/n − k/(n + 1)= k/n(n + 1)– ProporVonfoundbydividingbytotalnumberofwords=highestrank=k
– So,proporVonwithfrequencynis1/n(n+1)
Zipf’sLaw
• Exampleword
frequencyranking
• Tocomputenumberofwordswithfrequency493– rankof“png”minustherankof“defend”
– 5005−5001=4
Rank Word Freq
4999 objecVve 494
5000 albany 494
5001 defend 494
5002 appeals 493
5003 125 493
5004 lasVng 493
5005 png 493
![Page 14: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/14.jpg)
3/17/09
14
Example
• ProporVonsofwordsoccurringnVmesin5,001Wikipediadocuments
• Vocabularysizeis348,436.
Num.occurrences(n)
Predictedpropor:on(1/n(n+1))
Actualpropor:on
Actualnumberofwords
1 .500 .469 163,404
2 .167 .151 52,672
3 .083 .070 24,272
4 .050 .045 15,685
5 .033 .030 10,437
6 .024 .022 7,832
7 .018 .017 5,962
8 .014 .014 4,890
9 .011 .011 3,886
10 .009 .009 3,291
VocabularyGrowth
• Ascorpusgrows,sodoesvocabularysize– Fewernewwordswhencorpusisalreadylarge
• ObservedrelaVonship(Heaps’ Law):
v=k.nβ
wherevisvocabularysize(numberofuniquewords),nisthenumberofwordsincorpus, k,β areparametersthatvaryfor
eachcorpus (typicalvaluesgivenare10≤ k ≤ 100 andβ ≈ 0.5)
![Page 15: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/15.jpg)
3/17/09
15
wiki000SubsetExample
Words in collection
Voca
bula
ry s
ize
v ≈ 18.61·n0.5819
Heaps’LawPredicVons
• PredicVonsforTRECcollecVonsareaccurateforlargenumbersofwords– e.g.,first22,545,922wordsofwiki000scanned– predicVonis353,587uniquewords– actualnumberis348,436
• PredicVonsforsmallnumbersofwords(i.e.<1000)aremuchworse
![Page 16: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/16.jpg)
3/17/09
16
Heaps’LawPredicVons
• Heaps’Lawworkswithverylargecorpora– newwordsoccurringevenaierseeing30million!
• Newwordscomefromavarietyofsources• spellingerrors,inventedwords(e.g.product,companynames),code,otherlanguages,emailaddresses,etc.
• Searchenginesmustdealwiththeselargeandgrowingvocabularies
Stopping
• FuncVonwords(determiners,preposiVons)haveliFlemeaningontheirown
• Highoccurrencefrequencies– Top6words:the, of, and, in, to, a
• Treatedasstopwords (i.e.removed)– reduceindexspace,improveresponseVme,improveeffecVveness
• CanbeimportantincombinaVons– e.g.,“tobeornottobe”
![Page 17: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/17.jpg)
3/17/09
17
Stopping
• Keeptrackofallverycommonwordsinastopwords list.
• Duringtextprocessing,ignoreanywordonthelist.
• Stopwordlistcanbecreatedfromhigh‐frequencywordsorbasedonastandardlist
• ListsarecustomizedforapplicaVons,domains,andevenpartsofdocuments– e.g.,“click”isagoodstopwordforanchortext
Stopping
• Whenstoragespaceisnotaconcern,itcanbebeFertonotstop.– Queriesarelessrestricted.– RemovestopwordsatqueryVmeunlessusersaystoincludethem.
• Googledoesnotstop.– “tobeornottobe” returnsresults.– +thereturnsresults(over14billion).
![Page 18: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/18.jpg)
3/17/09
18
EndResultofStopping
• Listofwordsminusthoseonthestoplist.– tropicalfishincludefishfoundtropicalenvironmentsaroundworldincludingbothfreshwatersaltwaterspeciesfishkeepersoienusetermtropicalfishreferonlythoserequiringfreshwatersaltwatertropicalfishreferredmarinefish
• Nextstep:stemming.
Stemming• ManymorphologicalvariaVonsofwords
– inflecFonal(plurals,tenses)– derivaFonal(makingverbsnounsetc.)
• Inmostcases,thesehavethesameorverysimilarmeanings
• StemmersaFempttoreducemorphologicalvariaVonsofwordstoacommonstem– usuallyinvolvesremovingsuffixes
• CanbedoneatindexingVmeoraspartofqueryprocessing(likestopwords)
![Page 19: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/19.jpg)
3/17/09
19
Stemming
• GenerallyasmallbutsignificanteffecVvenessimprovement– canbecrucialforsomelanguages– e.g.,5‐10%improvementforEnglish,upto50%inArabic
Words with the Arabic root ktb
Stemming
• Twobasictypes– DicVonary‐based:useslistsofrelatedwords– Algorithmic:usesprogramtodeterminerelatedwords
• Algorithmicstemmers– suffix‐s: remove‘s’endingsassumingplural
• e.g.,cats→cat,lakes→lake
• Manyfalse negaFves:supplies→supplie• Somefalse posiFves:ups→up
![Page 20: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/20.jpg)
3/17/09
20
PorterStemmer
• AlgorithmicstemmerusedinIRexperimentssincethe70s
• Consistsofaseriesofrulesdesignedtothelongestpossiblesuffixateachstep
• ProvablyeffecVve• Producesstemsnotwords
• Makesanumberoferrorsanddifficulttomodify
PorterStemmer
• Examplestep(1of5)
![Page 21: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/21.jpg)
3/17/09
21
PorterStemmer
• Porter2stemmeraddressessomeoftheseissues
• Approachhasbeenusedwithotherlanguages
KrovetzStemmer
• Hybridalgorithmic‐dicVonary– WordcheckedindicVonary
• Ifpresent,eitherleialoneorreplacedwith“excepVon”• Ifnotpresent,wordischeckedforsuffixesthatcouldberemoved
• Aierremoval,dicVonaryischeckedagain
• Produceswordsnotstems• ComparableeffecVveness• LowerfalseposiVverate,somewhathigherfalsenegaVve
![Page 22: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/22.jpg)
3/17/09
22
StemmerComparison
EndResultofStemming
• Listofstemmedterms:– tropicfishincludefishfoundtropicenvironaroundworldincludebothfreshwatsaltwaterspecifishkeepoienusetermtropicfishreferonlithoserequirfreshwatersaltwattropicfishrefermarinfish
– (fromPorter2stemmer)
• Nextstep:advancedprocessing,orindexing.
![Page 23: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/23.jpg)
3/17/09
23
Martin Hall, 49, head of public policy and external affairs at the London Stock Exchange, is to leave at the end of June.
… The departure of Hall, who had
been in the running to be head of corporate affairs at the BBC, appears to have been prompted by the decision of the new chief executive, Michael Lawrence, to split Hall’s job in two and take the public policy element under his own wing.
<person id=pe1>Martin Hall</person>, 49, <sense num=2>head</sense> of <ow1>public policy</ow1> and external affairs at the <corp id=co1>London Stock Exchange</corp>, is to <syn grp=1>leave</syn> at the end of June.
… The <syn grp=1>departure</syn> of
<person id=pe1>Hall</person>, <ref to=pe1>who</ref> had been in the running to be head of corporate affairs at the <corp id=co2>BBC</corp>, appears to have been prompted by the decision of the new chief executive, <person id=pe2>Michael Lawrence</person>, to split <person id=pe1>Hall’s</person> job in two and take the public policy element under <ref to=pe1>his</ref> own wing.
AdvancedTextProcessing
• Part‐of‐speechtagging.• SensedisambiguaVon.• SynonymclassificaVon.• NamedenVtytagging.• PhraseidenVficaVon.• ReferentresoluVon.• SentencesegmentaVon.• TranslaVon.• SpeechrecogniVon.
TextProcessingErrors
• Alltextprocessingiserrorful.– DesigndecisionsproducesegmentaVonerrors,stoppingerrors,stemmingerrors.
– FalseposiVvesandfalsenegaVves.– Moreadvancedmethodsmoredifficultprocessingmoreerrors.
• Doesthebenefitoutweighthecost?– SegmentaVon&stemming:definitely.– POStagging,NEtagging:dependsondomain.– Synonymclasses:maybenot.
![Page 24: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/24.jpg)
3/17/09
24
EndResultofTextProcessing<title>Tropical fish</title> <text>{{Unreferenced|date=July 2008}} {{Original research|date=July 2008}} ’’’Tropical fish’’’ include [[fish]] found in [[Tropics|topical]]
environments around the world, including both [[fresh water|freshwater]] and [[sea water|salt water]] species. [[Fishkeeping|Fishkeepers]] often use the term ’’tropical fish’’ to refer only those requiring fresh water, with saltwater tropical fish referred to as ’’[[list of marine aquarium fish species|marine fish]]’’.
• Metadata:– Title:Tropicalfish
• Importantfields:– Links:fishtropicfreshwatsalt
waterfishkeepmarinfish
• Body:– tropicfishincludefishfound
tropicenvironaroundworldincludebothfreshwatsaltwaterspecifishkeepoienusetermtropicfishreferonlithoserequirfreshwatersaltwattropicfishrefermarinfish
CourseProject
• PhaseI,worksheet1.– Writeatextprocessingmodule.
– ParseWikipediapages,tokenize,stop,andstem.– AnswerquesVonsaboutWikipediadata:howbigisvocabulary,howmanywordoccurrencesarethere,etc.
• DuenextWednesday.– PleasestartASAP!
![Page 25: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/25.jpg)
3/17/09
25
ExpectaVons
• ReadWikipediapagesoffdisk.• IdenVfypartsofthemthatdonotneedtobeindexed.
• Converttherestintoalistofwords.• Dropstopwords,stemremainingwordstoterms.
• KeeptrackofthenumberofVmeseachtermappears,howmanydocumentsitappearsin.
PseudoJavaimport java.io.*; import java.util.*;
… HashMap<String, int> termCounts = new HashMap();
File doc = new File(filename); Scanner docScanner = new Scanner(doc); while (docScanner.hasNextLine()) {
List<String> terms = processLine(docScanner.nextLine()) for (int i=0; i < terms.size(); i++) { String currentTerm = terms.get(i); int termCount = termCounts.get(currentTerm);
termCounts.set(currentTerm, termCount+1); }
}
docScanner.close()
![Page 26: Text Processing - ir.cis.udel.edu](https://reader031.vdocuments.net/reader031/viewer/2022011903/61d68f8420354329df1ad425/html5/thumbnails/26.jpg)
3/17/09
26
public List processLine(String line) { List<String> terms = new List();
int i = 0;
Scanner lineScanner = new Scanner(line);
lineScanner.useDelimiter(“\\s*”); while (lineScanner.hasNext()) { String word = lineScanner.next();
/* check if word is appropriate for indexing or if it marks the start of a block to ignore */ if (word.indexOf(“{{“) >= 0)
/* ignore words until closing the block with a }}
… /* other conditions */
/* strip non-alphanumeric characters and lower-case */
word = word.replaceAll("[^a-zA-Z0-9]", ""); word = word.toLowerCase();
/* check if word is in the stop list */
if (!isStopWord(word)) { word = stemmer.stem(word); /* stem word */ terms.set(i, word);
i++; } } return(terms);
}