basic text processing - people.cs.georgetown.edu

Post on 10-Dec-2021

12 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

BasicTextProcessing

RegularExpressionsSLP3slides(Jurafsky &Martin)

Regularexpressions• Aformallanguageforspecifyingtextstrings• Howcanwesearchforanyofthese?

• woodchuck• woodchucks• Woodchuck• Woodchucks

2

RegularExpressions:Disjunctions• Lettersinsidesquarebrackets[]

• Ranges [A-Z]

Pattern Matches[wW]oodchuck Woodchuck, woodchuck[1234567890] Anydigit

Pattern Matches[A-Z] Anuppercaseletter Drenched Blossoms

[a-z] Alowercaseletter my beans were impatient

[0-9] Asingle digit Chapter 1: Down the Rabbit Hole

RegularExpressions:NegationinDisjunction

• Negations [^Ss]• Caratmeansnegationonlywhenfirstin[]

Pattern Matches[^A-Z] Not anuppercaseletter Oyfn pripetchik

[^Ss] Neither‘S’nor‘s’ I have no exquisite reason”

[^e^] Neitherenor^ Look here

a^b Thepattern a carat b Look up a^b now

3

RegularExpressions:MoreDisjunction

• Woodchucksisanothernameforgroundhog!• Thepipe|fordisjunction

Pattern Matchesgroundhog|woodchuck

yours|mine yoursmine

a|b|c =[abc][gG]roundhog|[Ww]oodchuck

RegularExpressions:? * + .

StephenCKleene

Pattern Matchescolou?r Optional

previouscharcolor colour

oo*h! 0ormoreofpreviouschar

oh! ooh! oooh! ooooh!

o+h! 1ormoreofpreviouschar

oh! ooh! oooh! ooooh!

baa+ baa baaa baaaa baaaaa

beg.n begin begun begun beg3n Kleene *,Kleene +

4

RegularExpressions:Anchors^$

Pattern Matches^[A-Z] Palo Alto

^[^A-Za-z] 1 “Hello”

\.$ The end.

.$ The end? The end!

Example

• Findmeallinstancesoftheword“the”inatext.the

Missescapitalizedexamples[tT]he

Incorrectlyreturnsother ortheology[^a-zA-Z][tT]he[^a-zA-Z]

5

9

Refertohttp://people.cs.georgetown.edu/nschneid/cosc272/f17/02_py-notes.htmlandlinksonthatpageforfurtherregexnotation,andadviceforusingregexesinPython3.

Errors

• Theprocesswejustwentthroughwasbasedonfixingtwokindsoferrors• Matchingstringsthatweshouldnothavematched(there,then,other)• Falsepositives(TypeI)

• Notmatchingthingsthatweshouldhavematched(The)• Falsenegatives(TypeII)

6

Errorscont.

• InNLPwearealwaysdealingwiththesekindsoferrors.

• Reducingtheerrorrateforanapplicationofteninvolvestwoantagonisticefforts:• Increasingaccuracyorprecision(minimizingfalsepositives)• Increasingcoverageorrecall(minimizingfalsenegatives).

Summary

• Regularexpressionsplayasurprisinglylargerole• Sophisticatedsequencesofregularexpressionsareoftenthefirstmodelforanytextprocessingtext

• Formanyhardtasks,weusemachinelearningclassifiers• Butregularexpressionsareusedasfeaturesintheclassifiers• Canbeveryusefulincapturinggeneralizations

12

7

BasicTextProcessing

RegularExpressions

BasicTextProcessing

Wordtokenization

8

TextNormalization• EveryNLPtaskneedstodotextnormalization:1. Segmenting/tokenizingwordsinrunningtext2. Normalizingwordformats3. Segmentingsentencesinrunningtext

Howmanywords?

• Idouhmain- mainlybusinessdataprocessing• Fragments,filledpauses

• Seuss’scatinthehatisdifferentfromother cats!• Lemma:samestem,partofspeech,roughwordsense• catandcats=samelemma

• Wordform:thefullinflectedsurfaceform• catandcats=differentwordforms

9

Howmanywords?theylaybackontheSanFranciscograssandlookedatthestarsandtheir

• Type:anelementofthevocabulary.• Token:aninstanceofthattypeinrunningtext.• Howmany?

• 15tokens(or14)• 13types(or12)(or11?)

Howmanywords?

N =numberoftokensV =vocabulary=setoftypes

|V| isthesizeofthevocabulary

Tokens=N Types=|V|Switchboardphoneconversations

2.4million 20 thousand

Shakespeare 884,000 31 thousandGoogleN-grams 1trillion 13million

ChurchandGale(1990):|V|>O(N½)

10

19

Tokenization

Normal written conventions sometimes do not reflect the “logical” organizationof textual symbols. For example, some punctuation marks are written adjacent tothe previous or following word, even though they are not part of it. (The detailsvary according to language and style guide!)

Given a string of raw text, a tokenizer adds logical boundaries between separateword/punctuation tokens (occurrences) not already separated by spaces:

Daniels made several appearances as C-3PO on numerous TV shows andcommercials, notably on a Star Wars-themed episode of The Donny and

Marie Show in 1977, Disneyland’s 35th Anniversary.)

Daniels made several appearances as C-3PO on numerous TV shows andcommercials , notably on a Star Wars - themed episode of The Donny and

Marie Show in 1977 , Disneyland ’s 35th Anniversary .

To a large extent, this can be automated by rules. But there are always di�cultcases.

Nathan Schneider ANLP (COSC/LING-272) Lecture 2 27

20

Tokenization in NLTK

>>> nltk.word_tokenize("Daniels made several,! appearances as C-3PO on numerous TV shows and,! commercials, notably on a Star Wars-themed episode,! of The Donny and Marie Show in 1977, Disneyland's,! 35th Anniversary.")

['Daniels', 'made', 'several', 'appearances', 'as',,! 'C-3PO', 'on', 'numerous', 'TV', 'shows', 'and',,! 'commercials', ',', 'notably', 'on', 'a', 'Star',,! 'Wars-themed', 'episode', 'of', 'The', 'Donny',,! 'and', 'Marie', 'Show', 'in', '1977', ',',,! 'Disneyland', "'s", '35th', 'Anniversary', '.']

Nathan Schneider ANLP (COSC/LING-272) Lecture 2 28

11

21

Tokenization in NLTK

>>> nltk.word_tokenize("Daniels made several,! appearances as C-3PO on numerous TV shows and,! commercials, notably on a Star Wars-themed episode,! of The Donny and Marie Show in 1977, Disneyland's,! 35th Anniversary.")

['Daniels', 'made', 'several', 'appearances', 'as',,! 'C-3PO', 'on', 'numerous', 'TV', 'shows', 'and',,! 'commercials', ',', 'notably', 'on', 'a', 'Star',,! 'Wars-themed', 'episode', 'of', 'The', 'Donny',,! 'and', 'Marie', 'Show', 'in', '1977', ',',,! 'Disneyland', "'s", '35th', 'Anniversary', '.']

English tokenization conventions vary somewhat—e.g., with respect to:

• clitics (contracted forms) ’s, n’t, ’re, etc.

• hyphens in compounds like president-elect (fun fact: this convention changedbetween versions of the Penn Treebank!)

Nathan Schneider ANLP (COSC/LING-272) Lecture 2 29

IssuesinTokenization

• Finland’s capital ® Finland Finlands Finland’s ?• what’re, I’m, isn’t ® What are, I am, is not• Hewlett-Packard ® Hewlett Packard ?• state-of-the-art ® state of the art ?• Lowercase ® lower-case lowercase lower case ?• San Francisco ® onetokenortwo?• m.p.h.,PhD. ® ??

12

Tokenization:languageissues

• French• L'ensemble® onetokenortwo?• L?L’?Le?• Wantl’ensemble tomatchwithunensemble

• Germannouncompoundsarenotsegmented• Lebensversicherungsgesellschaftsangestellter• ‘lifeinsurancecompanyemployee’• Germaninformationretrievalneedscompoundsplitter

Tokenization:languageissues• ChineseandJapanesenospacesbetweenwords:

• 莎拉波娃现在居住在美国东南部的佛罗里达。• 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达

• Sharapova now livesin USsoutheasternFlorida

• FurthercomplicatedinJapanese,withmultiplealphabetsintermingled• Dates/amountsinmultipleformats

フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana Hiragana Kanji Romaji

End-usercanexpressqueryentirelyinhiragana!

13

WordTokenizationinChinese

• AlsocalledWordSegmentation• Chinesewordsarecomposedofcharacters

• Charactersaregenerally1syllableand1morpheme.• Averagewordis2.4characterslong.

• Standardbaselinesegmentationalgorithm:• MaximumMatching (alsocalledGreedy)

BasicTextProcessing

Wordtokenization

14

BasicTextProcessing

WordNormalizationandStemming

Normalization

• Needto“normalize”terms• InformationRetrieval:indexedtext&querytermsmusthavesameform.

• WewanttomatchU.S.A. andUSA

• Weimplicitlydefineequivalenceclassesofterms• e.g.,deletingperiodsinaterm

• Alternative:asymmetricexpansion:• Enter:window Search:window,windows• Enter:windows Search:Windows,windows,window• Enter:Windows Search:Windows

• Potentiallymorepowerful,butlessefficient

15

Casefolding

• ApplicationslikeIR:reduceallletterstolowercase• Sinceuserstendtouselowercase• Possibleexception:uppercaseinmid-sentence?• e.g.,GeneralMotors• Fed vs.fed• SAIL vs.sail

• Forsentimentanalysis,MT,Informationextraction• Caseishelpful(US versususisimportant)

Lemmatization

• Reduceinflectionsorvariantformstobaseform• am,are, is® be

• car,cars,car's,cars'® car

• theboy'scarsaredifferentcolors® theboycarbedifferentcolor• Lemmatization:havetofindcorrectdictionaryheadwordform• Machinetranslation

• Spanishquiero (‘Iwant’),quieres (‘youwant’)samelemmaasquerer‘want’

16

Morphology

• Morphemes:• Thesmallmeaningfulunitsthatmakeupwords• Stems:Thecoremeaning-bearingunits• Affixes:Bitsandpiecesthatadheretostems• Oftenwithgrammaticalfunctions

Stemming

• Reducetermstotheirstemsininformationretrieval• Stemming iscrudechoppingofaffixes

• languagedependent• e.g.,automate(s),automatic,automation allreducedtoautomat.

forexamplecompressedandcompressionarebothacceptedasequivalenttocompress.

forexampl compressandcompressar bothacceptasequival tocompress

17

BasicTextProcessing

WordNormalizationandStemming

BasicTextProcessing

SentenceSegmentationandDecisionTrees

18

SentenceSegmentation

• !,?arerelativelyunambiguous• Period“.”isquiteambiguous

• Sentenceboundary• AbbreviationslikeInc.orDr.• Numberslike.02%or4.3

• Buildabinaryclassifier• Looksata“.”• DecidesEndOfSentence/NotEndOfSentence• Classifiers:hand-writtenrules,regularexpressions,ormachine-learning

Determiningifawordisend-of-sentence:aDecisionTree

19

Moresophisticateddecisiontreefeatures

• Caseofwordwith“.”:Upper,Lower,Cap,Number• Caseofwordafter“.”:Upper,Lower,Cap,Number

• Numericfeatures• Lengthofwordwith“.”• Probability(wordwith“.”occursatend-of-s)• Probability(wordafter“.”occursatbeginning-of-s)

ImplementingDecisionTrees

• Adecisiontreeisjustanif-then-elsestatement• Theinterestingresearchischoosingthefeatures• Settingupthestructureisoftentoohardtodobyhand

• Hand-buildingonlypossibleforverysimplefeatures,domains• Fornumericfeatures,it’stoohardtopickeachthreshold

• Instead,structureusuallylearnedbymachinelearningfromatrainingcorpus

20

DecisionTreesandotherclassifiers

• Wecanthinkofthequestionsinadecisiontree• Asfeaturesthatcouldbeexploitedbyanykindofclassifier• Logisticregression• SVM• NeuralNets• etc.

BasicTextProcessing

SentenceSegmentationandDecisionTrees

top related