basic text processing - people.cs.georgetown.edu
TRANSCRIPT
1
BasicTextProcessing
RegularExpressionsSLP3slides(Jurafsky &Martin)
Regularexpressions• Aformallanguageforspecifyingtextstrings• Howcanwesearchforanyofthese?
• woodchuck• woodchucks• Woodchuck• Woodchucks
2
RegularExpressions:Disjunctions• Lettersinsidesquarebrackets[]
• Ranges [A-Z]
Pattern Matches[wW]oodchuck Woodchuck, woodchuck[1234567890] Anydigit
Pattern Matches[A-Z] Anuppercaseletter Drenched Blossoms
[a-z] Alowercaseletter my beans were impatient
[0-9] Asingle digit Chapter 1: Down the Rabbit Hole
RegularExpressions:NegationinDisjunction
• Negations [^Ss]• Caratmeansnegationonlywhenfirstin[]
Pattern Matches[^A-Z] Not anuppercaseletter Oyfn pripetchik
[^Ss] Neither‘S’nor‘s’ I have no exquisite reason”
[^e^] Neitherenor^ Look here
a^b Thepattern a carat b Look up a^b now
3
RegularExpressions:MoreDisjunction
• Woodchucksisanothernameforgroundhog!• Thepipe|fordisjunction
Pattern Matchesgroundhog|woodchuck
yours|mine yoursmine
a|b|c =[abc][gG]roundhog|[Ww]oodchuck
RegularExpressions:? * + .
StephenCKleene
Pattern Matchescolou?r Optional
previouscharcolor colour
oo*h! 0ormoreofpreviouschar
oh! ooh! oooh! ooooh!
o+h! 1ormoreofpreviouschar
oh! ooh! oooh! ooooh!
baa+ baa baaa baaaa baaaaa
beg.n begin begun begun beg3n Kleene *,Kleene +
4
RegularExpressions:Anchors^$
Pattern Matches^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
Example
• Findmeallinstancesoftheword“the”inatext.the
Missescapitalizedexamples[tT]he
Incorrectlyreturnsother ortheology[^a-zA-Z][tT]he[^a-zA-Z]
5
9
Refertohttp://people.cs.georgetown.edu/nschneid/cosc272/f17/02_py-notes.htmlandlinksonthatpageforfurtherregexnotation,andadviceforusingregexesinPython3.
Errors
• Theprocesswejustwentthroughwasbasedonfixingtwokindsoferrors• Matchingstringsthatweshouldnothavematched(there,then,other)• Falsepositives(TypeI)
• Notmatchingthingsthatweshouldhavematched(The)• Falsenegatives(TypeII)
6
Errorscont.
• InNLPwearealwaysdealingwiththesekindsoferrors.
• Reducingtheerrorrateforanapplicationofteninvolvestwoantagonisticefforts:• Increasingaccuracyorprecision(minimizingfalsepositives)• Increasingcoverageorrecall(minimizingfalsenegatives).
Summary
• Regularexpressionsplayasurprisinglylargerole• Sophisticatedsequencesofregularexpressionsareoftenthefirstmodelforanytextprocessingtext
• Formanyhardtasks,weusemachinelearningclassifiers• Butregularexpressionsareusedasfeaturesintheclassifiers• Canbeveryusefulincapturinggeneralizations
12
7
BasicTextProcessing
RegularExpressions
BasicTextProcessing
Wordtokenization
8
TextNormalization• EveryNLPtaskneedstodotextnormalization:1. Segmenting/tokenizingwordsinrunningtext2. Normalizingwordformats3. Segmentingsentencesinrunningtext
Howmanywords?
• Idouhmain- mainlybusinessdataprocessing• Fragments,filledpauses
• Seuss’scatinthehatisdifferentfromother cats!• Lemma:samestem,partofspeech,roughwordsense• catandcats=samelemma
• Wordform:thefullinflectedsurfaceform• catandcats=differentwordforms
9
Howmanywords?theylaybackontheSanFranciscograssandlookedatthestarsandtheir
• Type:anelementofthevocabulary.• Token:aninstanceofthattypeinrunningtext.• Howmany?
• 15tokens(or14)• 13types(or12)(or11?)
Howmanywords?
N =numberoftokensV =vocabulary=setoftypes
|V| isthesizeofthevocabulary
Tokens=N Types=|V|Switchboardphoneconversations
2.4million 20 thousand
Shakespeare 884,000 31 thousandGoogleN-grams 1trillion 13million
ChurchandGale(1990):|V|>O(N½)
10
19
Tokenization
Normal written conventions sometimes do not reflect the “logical” organizationof textual symbols. For example, some punctuation marks are written adjacent tothe previous or following word, even though they are not part of it. (The detailsvary according to language and style guide!)
Given a string of raw text, a tokenizer adds logical boundaries between separateword/punctuation tokens (occurrences) not already separated by spaces:
Daniels made several appearances as C-3PO on numerous TV shows andcommercials, notably on a Star Wars-themed episode of The Donny and
Marie Show in 1977, Disneyland’s 35th Anniversary.)
Daniels made several appearances as C-3PO on numerous TV shows andcommercials , notably on a Star Wars - themed episode of The Donny and
Marie Show in 1977 , Disneyland ’s 35th Anniversary .
To a large extent, this can be automated by rules. But there are always di�cultcases.
Nathan Schneider ANLP (COSC/LING-272) Lecture 2 27
20
Tokenization in NLTK
>>> nltk.word_tokenize("Daniels made several,! appearances as C-3PO on numerous TV shows and,! commercials, notably on a Star Wars-themed episode,! of The Donny and Marie Show in 1977, Disneyland's,! 35th Anniversary.")
['Daniels', 'made', 'several', 'appearances', 'as',,! 'C-3PO', 'on', 'numerous', 'TV', 'shows', 'and',,! 'commercials', ',', 'notably', 'on', 'a', 'Star',,! 'Wars-themed', 'episode', 'of', 'The', 'Donny',,! 'and', 'Marie', 'Show', 'in', '1977', ',',,! 'Disneyland', "'s", '35th', 'Anniversary', '.']
Nathan Schneider ANLP (COSC/LING-272) Lecture 2 28
11
21
Tokenization in NLTK
>>> nltk.word_tokenize("Daniels made several,! appearances as C-3PO on numerous TV shows and,! commercials, notably on a Star Wars-themed episode,! of The Donny and Marie Show in 1977, Disneyland's,! 35th Anniversary.")
['Daniels', 'made', 'several', 'appearances', 'as',,! 'C-3PO', 'on', 'numerous', 'TV', 'shows', 'and',,! 'commercials', ',', 'notably', 'on', 'a', 'Star',,! 'Wars-themed', 'episode', 'of', 'The', 'Donny',,! 'and', 'Marie', 'Show', 'in', '1977', ',',,! 'Disneyland', "'s", '35th', 'Anniversary', '.']
English tokenization conventions vary somewhat—e.g., with respect to:
• clitics (contracted forms) ’s, n’t, ’re, etc.
• hyphens in compounds like president-elect (fun fact: this convention changedbetween versions of the Penn Treebank!)
Nathan Schneider ANLP (COSC/LING-272) Lecture 2 29
IssuesinTokenization
• Finland’s capital ® Finland Finlands Finland’s ?• what’re, I’m, isn’t ® What are, I am, is not• Hewlett-Packard ® Hewlett Packard ?• state-of-the-art ® state of the art ?• Lowercase ® lower-case lowercase lower case ?• San Francisco ® onetokenortwo?• m.p.h.,PhD. ® ??
12
Tokenization:languageissues
• French• L'ensemble® onetokenortwo?• L?L’?Le?• Wantl’ensemble tomatchwithunensemble
• Germannouncompoundsarenotsegmented• Lebensversicherungsgesellschaftsangestellter• ‘lifeinsurancecompanyemployee’• Germaninformationretrievalneedscompoundsplitter
Tokenization:languageissues• ChineseandJapanesenospacesbetweenwords:
• 莎拉波娃现在居住在美国东南部的佛罗里达。• 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达
• Sharapova now livesin USsoutheasternFlorida
• FurthercomplicatedinJapanese,withmultiplealphabetsintermingled• Dates/amountsinmultipleformats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Katakana Hiragana Kanji Romaji
End-usercanexpressqueryentirelyinhiragana!
13
WordTokenizationinChinese
• AlsocalledWordSegmentation• Chinesewordsarecomposedofcharacters
• Charactersaregenerally1syllableand1morpheme.• Averagewordis2.4characterslong.
• Standardbaselinesegmentationalgorithm:• MaximumMatching (alsocalledGreedy)
BasicTextProcessing
Wordtokenization
14
BasicTextProcessing
WordNormalizationandStemming
Normalization
• Needto“normalize”terms• InformationRetrieval:indexedtext&querytermsmusthavesameform.
• WewanttomatchU.S.A. andUSA
• Weimplicitlydefineequivalenceclassesofterms• e.g.,deletingperiodsinaterm
• Alternative:asymmetricexpansion:• Enter:window Search:window,windows• Enter:windows Search:Windows,windows,window• Enter:Windows Search:Windows
• Potentiallymorepowerful,butlessefficient
15
Casefolding
• ApplicationslikeIR:reduceallletterstolowercase• Sinceuserstendtouselowercase• Possibleexception:uppercaseinmid-sentence?• e.g.,GeneralMotors• Fed vs.fed• SAIL vs.sail
• Forsentimentanalysis,MT,Informationextraction• Caseishelpful(US versususisimportant)
Lemmatization
• Reduceinflectionsorvariantformstobaseform• am,are, is® be
• car,cars,car's,cars'® car
• theboy'scarsaredifferentcolors® theboycarbedifferentcolor• Lemmatization:havetofindcorrectdictionaryheadwordform• Machinetranslation
• Spanishquiero (‘Iwant’),quieres (‘youwant’)samelemmaasquerer‘want’
16
Morphology
• Morphemes:• Thesmallmeaningfulunitsthatmakeupwords• Stems:Thecoremeaning-bearingunits• Affixes:Bitsandpiecesthatadheretostems• Oftenwithgrammaticalfunctions
Stemming
• Reducetermstotheirstemsininformationretrieval• Stemming iscrudechoppingofaffixes
• languagedependent• e.g.,automate(s),automatic,automation allreducedtoautomat.
forexamplecompressedandcompressionarebothacceptedasequivalenttocompress.
forexampl compressandcompressar bothacceptasequival tocompress
17
BasicTextProcessing
WordNormalizationandStemming
BasicTextProcessing
SentenceSegmentationandDecisionTrees
18
SentenceSegmentation
• !,?arerelativelyunambiguous• Period“.”isquiteambiguous
• Sentenceboundary• AbbreviationslikeInc.orDr.• Numberslike.02%or4.3
• Buildabinaryclassifier• Looksata“.”• DecidesEndOfSentence/NotEndOfSentence• Classifiers:hand-writtenrules,regularexpressions,ormachine-learning
Determiningifawordisend-of-sentence:aDecisionTree
19
Moresophisticateddecisiontreefeatures
• Caseofwordwith“.”:Upper,Lower,Cap,Number• Caseofwordafter“.”:Upper,Lower,Cap,Number
• Numericfeatures• Lengthofwordwith“.”• Probability(wordwith“.”occursatend-of-s)• Probability(wordafter“.”occursatbeginning-of-s)
ImplementingDecisionTrees
• Adecisiontreeisjustanif-then-elsestatement• Theinterestingresearchischoosingthefeatures• Settingupthestructureisoftentoohardtodobyhand
• Hand-buildingonlypossibleforverysimplefeatures,domains• Fornumericfeatures,it’stoohardtopickeachthreshold
• Instead,structureusuallylearnedbymachinelearningfromatrainingcorpus
20
DecisionTreesandotherclassifiers
• Wecanthinkofthequestionsinadecisiontree• Asfeaturesthatcouldbeexploitedbyanykindofclassifier• Logisticregression• SVM• NeuralNets• etc.
BasicTextProcessing
SentenceSegmentationandDecisionTrees