text mining challenges and solutions in big · pdf file1/16/17 1 dr. normand péladeau...

11

Upload: doandan

Post on 06-Feb-2018

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: TEXT MINING CHALLENGES AND SOLUTIONS IN BIG · PDF file1/16/17 1 Dr. Normand Péladeau President& CEO ProvalisResearchCorp. peladeau@provalisresearch.com TEXT MINING CHALLENGES AND

1/16/17

1

Dr.NormandPéladeauPresident &CEO

Provalis Research [email protected]

TEXTMININGCHALLENGESANDSOLUTIONSINBIGDATA

Dr.DerrickL.CogburnHICSSGlobalVirtualTeamsMini-TrackCo-Chair

HICSSTextAnalyticsMini-TrackCo-ChairAssociateProfessor,SchoolofInternationalService

ExecutiveDirector,InstituteonDisabilityandPublicPolicyCOTELCO:TheCollaborationLaboratory

[email protected]

@derrickcogburn

Objectives

Attheendofthisworkshop,participantsshouldbeableto:

1. Understandthemainchallengestextanalystsarefacing.

2. Identifyvarioustextanalysisstrategiesandtechniquestodealwiththosechallenges.

3. Recognizetheirrespectivestrengthsandweaknesses.

4. Identifyvariousexploratorytextminingtechniques.

5. Applydictionaryconstructionandvalidationprinciples

Andifenoughtime6. Understandsomeofthebasicfeaturesofautomaticdocument

classificationtechniques

• PracticalTextMiningandStatisticalAnalysisforNon-StructuredTextDataApplications,GaryMineretal,AcademicPress/Elsevier(2012)(AvailableforKindle)

• RforEveryone:AdvancedAnalyticsandGraphics.JaredP.Lander.Addison-Wesley,2014.(AvailableforKindle)

• AnIntroductiontoDataScience,JeffreyStanton(2013).(FreeiBookorPDF)http://jsresearch.net/wiki/projects/teachdatascience

• Hadoop:TheEnginethatDrivesBigData,LarsNielsen.ExecutiveSummary(AvailableforKindle)NewStreetCommunications(2013)

• TextMiningforQualitativeDataAnalysisintheSocialSciences.Gregor Wiedemann (2016).Springer.

RecommendedTexts

• Bits=BinaryDigit(0,1)• Nibble=4bits• Byte=8bits(256combinations)

– (NB:KBv.Kb)storagev.transmission

Meaning 2nd Digit 1st Digit

No 0 0

Maybe 0 1

Probably 1 0

Definitely 1 1

Source:JeffStanton,IntroductiontoDataScience

Atwobitcodebookforresponsetoaninvitation

Understandingthe“data”inBigData

• Comparisonoffilesizes:– Kilobyte (KB)=1,024bytes(2-3paragraphsofplaintext)– Megabyte (MB)=1,048,576bytesor1,024Kilobytes(873pagesofplaintext)– Gigabyte (GB)1,073,741,824(230)bytes.1,024Megabytes,or1,048,576Kilobytes

(894,784pagesofplaintext)– Terabyte (TB)1,099,511,627,776(240)bytes,1,024Gigabytes,or1,048,576Megabytes

(916,259,689pagesofplaintext)– Petabyte (PB)1,125,899,906,842,624(250)bytes,1,024Terabytes,1,048,576Gigabytes,

or1,073,741,824Megabytes(938,249,922,368pagesofplaintext)– Exabyte (EB)1,152,921,504,606,846,976(260)bytes,1,024Petabytes,1,048,576

Terabytes,1,073,741,824Gigabytes,or1,099,511,627,776Megabytes(960,767,920,505,705pagesofplaintext)

– Zettabyte (ZB)1,180,591,620,717,411,303,424(270)bytes,1,024Exabytes,1,048,576Petabytes,1,073,741,824Terabytes,1,099,511,627,776Gigabytes,or1,125,899,910,000,000Megabytes(983,826,350,597,842,752pagesofplaintext)

– Yottabyte (YB)1,208,925,819,614,629,174,706,176(280)bytes,1,024Zettabytes,1,048,576Exabytes,1,073,741,824Petabytes,1,099,511,627,776Terabytes,1,125,899,910,000,000Gigabytes,or1,152,921,500,000,000,000Megabytes(1,007,438,183,012,190,978,921pagesofplaintext)

Source:http://www.computerhope.com/issues/chspace.htm

Understandingthe“big”inBigData

• “BigData”isarelativeterm– itmeansdifferentthingstodifferentpeople/disciplines:

• Whenwetalkof“Big”data,wemean“big”lessinabsolutetermsandmoreintermsrelativetothecomprehensivenatureofthedata.

• 75-80%oftheworld’savailabledataisunstructuredtext(unstructuredinformationgrowingat15timesstructured)

• “Inthepast50years,theNewYorkTimesproduced3billionwords”and“Twitterusersproduce8billionwords– everysingleday”(Kalev Leetaru,UniversityofIllinois,andKaisler,Armour,Espinosa,andMoney,2014)

• Inadditiontotext(websites,blogs,socialmedia,emailarchives,annualreports,meetingtranscripts,publishedarticles–newspapersandjournals)thereareimage,video,audio,GPS,RFID,andothertypesofBigData

Definingthe“big”inBigData

Page 2: TEXT MINING CHALLENGES AND SOLUTIONS IN BIG · PDF file1/16/17 1 Dr. Normand Péladeau President& CEO ProvalisResearchCorp. peladeau@provalisresearch.com TEXT MINING CHALLENGES AND

1/16/17

2

• The“ThreeVs Model”ofBigData

• Volume =theamountofavailabledata• Velocity =speedatwhichdataarrives/decays• Variety =differenttypesofdata

– Plus:

• Veracity =accuracyofthedata• Variability =differinginterpretationsofthedata• Value =relativeimportanceofthedata

Source:DougLaney,BusinessAnalyst,Gartner

Definingthe“big”inBigData• ToolsandTechniquesforscrapingsites– Software

• Sitesucker*• PageSucker• WebGrabber• WebDumper

– HandCoding• Perl• Python• Ruby

• Downloadingemaillistservearchives– Mbox format,Gmail,etc.

• TwitterAPIs– dev.twitter.com

• ProvalisSocialMediaScraper– WebCollector

• Newspapersandpublishedarticles– e-libraryresources,Lexis-Nexis,etc. *

AcquiringTextData

TextAnalyticsApplications• SentimentAnalysis(socialmedia)• VoiceoftheCustomer(emails,chat,callcentertranscripts)• Productimprovement(warrantyclaims)• CompetitiveIntelligence(patents,websites)• Riskmanagement(incidentormaintenancereports)• Frauddetection(insuranceclaims)• Reputationmanagement(news,blogs,socialmedia)• Scientometrics studies(journalarticles,titles&abstracts)• Crimeanalysis(narratives,computerforensics,testimonies)• Surveyanalysis(open-endedquestions)• Financialprediction(earningsreleases,news,pressreleases)• Surveillancesystem(communication,medicalreports)•Manymore...

TextMiningintheWorldDataSciences

FOURAPPROCHESTOTEXTANALYSIS

1. ComputerAssistedQualitativeAnalysis

2. Exploratorytextmining

3. QuantitativeContentAnalysis

4. AutomaticDocumentClassification

TextAnalysisLandscape

• Each text analysis method has its own strengths and weaknesses.

• No single method is appropriate for all text analysis tasks.

• A single text analysis task often benefit from combining several methods.

OurProposal

Page 3: TEXT MINING CHALLENGES AND SOLUTIONS IN BIG · PDF file1/16/17 1 Dr. Normand Péladeau President& CEO ProvalisResearchCorp. peladeau@provalisresearch.com TEXT MINING CHALLENGES AND

1/16/17

3

Toolswewilluse

ContentAnalysis&TextMining

QualitativeAnalysis&MixedMethods

SomeothertoolsavailableCommercialtools

IBMTextModeler forText,IBMWatsonSASTextMiner,Clarabridge,Lexalytics,AlchemyAPI,Attensity,Enkata,OdinText,etc.

Open-sourcetoolsTextminingmodulesRprogrammingmodules(R/tm)Gensim,Mallet,Quanteda,RapidMiner,Gate,KNIME

NLPLibrariesStanfordNLP,NaturalLanguageTookit (NLTK)OpenCalaisApacheOpenNLP,etc.

http://www.kdnuggets.com/software/text.html

Iamonly demon-trating theannota-tion feature,so youdon’t need toreadthis!

Still reading

Computer AssistedQualitativeAnalysis Computer AssistedQualitativeAnalysis

Necessity is the mother of inventionLaziness

THREE MAJOR OBSTACLES

1) Very large number of word forms

2) Polymorphy of languageOne idea ® multiple forms

3) Polysemy of wordsOne word ® many ideas

TextAnalyticsChallenge

Page 4: TEXT MINING CHALLENGES AND SOLUTIONS IN BIG · PDF file1/16/17 1 Dr. Normand Péladeau President& CEO ProvalisResearchCorp. peladeau@provalisresearch.com TEXT MINING CHALLENGES AND

1/16/17

4

THREE MAJOR OBSTACLES

1) Very large number of word forms

2) Polymorphy of languageOne idea ® multiple forms

3) Polysemy of wordsOne word ® many ideas

TextAnalyticsChallenge Challenge#1– Quantity

38,996commentsabouthotels• 2,1millionwords(tokens)• 20,116termsorwordforms(types)

1,8millioncourseevaluations• 35millionswords(tokens)• 78,159termsorwordforms(types)

TextAnalyticsChallenge TextAnalyticsChallenge

TextAnalyticsChallenge

• Theorderofthewordsinthedocumentdoesnotmatter

• Whilea“bigassumption”textminingexpertshavefoundthattheycanstilldifferentiatebetweensemanticconceptsbyusingallthewordsinthedocuments

• Donotworkinallsituationsandsomeinformationextractiontasksandnaturallanguageprocessingreliesheavilyonthewordsthemselves(e.g.partofspeechtagging)andtheorderofthewords(precedingandfollowing)

• Specializedalgorithmsareusedinthesecases

The “bagofwords”assumption

Page 5: TEXT MINING CHALLENGES AND SOLUTIONS IN BIG · PDF file1/16/17 1 Dr. Normand Péladeau President& CEO ProvalisResearchCorp. peladeau@provalisresearch.com TEXT MINING CHALLENGES AND

1/16/17

5

Challenge#1– SolutionsLinguisticPre-processing

• Removalofstopwords

• Stemming

• Lemmatization

Challenge#1– StopWords

RemovalofStopWords• Wordsthatareeitherinsignificant(i.e.,articles,prepositions)ortoocommon

• Examples:“the”,“and”,“or”,“a”,“of”,“to”,“at”,“is”,“it”,“have”,“who”,etc.

• Caution:usewithcare“IT”as“informationtechnology”

“TheWho”,“takethat”,“amust”

Negation:not,no,never,seldom,no,etc

“but”,“however”,“otherwise”

Challenge#1– StemmingStemming- Removalofcommonprefixesandsuffixesto

obtainawordstem

Example:prefix– stem– suffixun– avail– able

Issue:Stemmingerrors

universal,university,universe->univers

designate,design->design

paste,past->past

political,polite->polit

several,severance->sever

Challenge#1– Lemmatization

Lemmatization:Reducinginflectedformsofwordstotheircanonicalform.

Examples: walk,walks,walked,waking->walkam,are,is->be

Twoforms:1.Linguistic(veryslowbutmoreprecise)2.Statistical(fastbutlessaccurate)

Issue- Somelossinsemanticprecision• Differentusesofsingularvs pluralforms

• Differentusesofverbtenses

Challenge#1– SolutionsLinguisticPre-processing

• Removalofstopwords

• Stemming

• Lemmatization

Statisticaltools• Frequencyselection

• Datareductiontechniques(HCA,PCA,FA)

• Exploratorydataanalysis(ex.CA).

• MachineLearning

TheStatisticsofTextDistributionofwords:Zipf distribution

Page 6: TEXT MINING CHALLENGES AND SOLUTIONS IN BIG · PDF file1/16/17 1 Dr. Normand Péladeau President& CEO ProvalisResearchCorp. peladeau@provalisresearch.com TEXT MINING CHALLENGES AND

1/16/17

6

TheStatisticsofText

38,988commentsabouthotels2.1Mwords(20,114differentterms)

MOST FREQUENTTERMS

PERCENTAGE OFTERMS

PERCENTAGE OFWORDS

49 0.24% 50%

300 1.5% 76%

500 2.5% 83%

1000 5.0% 90%

TextAnalyticsChallenge

TFxIDF – Termfrequencyxinversedocumentfrequency

Heuristictechniqueforselectingwordsthatareimportantinacorpus

Principles:Ifawordappearsfrequentlyinadocument,it’simportantIfawordappearsinmanydocuments,itislessimportant

Basicformula:ft,d xlog(N/nt)

PROS

• Identificationoftopics&structureoftopics

• Maybeusedtoreducedimensionality

• Tendstogroupsynonyms(polymorphy)

CONS• Doesnotdealadequatelywithpolysemy ofwords

• Nosinglebestsolution

(morelater)

HierarchicalClustering

PROS

• Fastidentificationoftopics

• Reducedimensionality

• Dealpartiallywithsynonymy&polysemy

CONS

• Nosinglebestsolution

(morelater)

TopicModeling(LSA,pLSA,LDA,PAM,etc.)

PROS• Veryfast• Verylittleefforts• Inductive

CONS• Comparabilityofresults• Imprecisequantification• Insensitivetolowfrequencyevents• Sensitivetostructuredtextelements• Inductive

Text MiningApproach

THREE MAJOR OBSTACLES

1) Very large number of word forms

TextAnalyticsChallenge

Page 7: TEXT MINING CHALLENGES AND SOLUTIONS IN BIG · PDF file1/16/17 1 Dr. Normand Péladeau President& CEO ProvalisResearchCorp. peladeau@provalisresearch.com TEXT MINING CHALLENGES AND

1/16/17

7

THREE MAJOR OBSTACLES

1) Very large number of word forms

2) Polymorphy of languageOne idea ® multiple forms

TextAnalyticsChallenge Contentanalysismethod

Contentanalysismethod Contentanalysismethod

CreationofadictionaryCustom Dictionary forCourseEvaluation CreationofadictionaryCustom Dictionary forSDGsandPWDs

• SDG Goal 4: Education = 2– 4.5 By 2030, eliminate gender disparities

in education and ensure equal access to all levels of education and vocational training for the vulnerable, including persons with disabilities, indigenous peoples and children in vulnerable situations

– 4.a. Build and upgrade education facilities that are child, disability and gender sensitive and provide safe, nonviolent, inclusive and effective learning environments for all

• Goal4.5:– GenderDisparity

• Keywords• Keyphrases

– EqualAccesstoEducation

– EqualAccesstoVocationalTraining

• Goal4.a– Facilities– SafeEnvironment– InclusiveEnvironment– EffectiveLearning

Page 8: TEXT MINING CHALLENGES AND SOLUTIONS IN BIG · PDF file1/16/17 1 Dr. Normand Péladeau President& CEO ProvalisResearchCorp. peladeau@provalisresearch.com TEXT MINING CHALLENGES AND

1/16/17

8

CreationofadictionaryCustom Dictionary forSDGsandPWDs CreationofadictionaryCustom Dictionary forSDGsandPWDs

CreationofadictionaryCustom Dictionary forSDGsandPWDs ContentanalysismethodPROS

• Canpotentiallymeasuremoreaccurately• Canbefocused(multi-focus)• Allowsfullautomation• Allowscomparison(overtime– acrosstextcollections)• Allowsmeasurementsoflatentdimensions• Publiclyorcommerciallyavailabledictionaries• Deductiveapproach

PSYCHOMETRICMEASUREMENT

• LinguisticInquiryandWordCount(LIWC)- Pennebaker

• RegressiveImageryDictionary(RID)– Martindale

• CommunicationVaguenessDictionary– Hiller

• BrandPersonalityDictionary- Opoku

SOCIO-POLITICALMEASUREMENT

• DICTION- Hart• Lasswell ValueDictionary- Lasswell• GeneralInquirerHarvardIV- Stone

MeasureLatentDimensions MeasureLatentDimensions

COMMUNICATIONVAGUENESSDICTIONARY

Page 9: TEXT MINING CHALLENGES AND SOLUTIONS IN BIG · PDF file1/16/17 1 Dr. Normand Péladeau President& CEO ProvalisResearchCorp. peladeau@provalisresearch.com TEXT MINING CHALLENGES AND

1/16/17

9

ContentanalysismethodPROS

• Canpotentiallymeasuremoreaccurately• Canbefocused(multi-focus)• Allowsfullautomation• Allowscomparison(overtime– acrosstextcollection)• Allowsmeasurementsoflatentdimensions• Publiclyorcommerciallyavailabledictionaries• Deductive

CONS• Timerequiredforconstruction&validation• Improperuseofexistingdictionaries

Toolsfordictionaryconstruction

- Clustering&topicmodeling

- Frequencylistofwords

- Phraseextraction

- Namedentityrecognition(NER)

- Thesauri&lexicaldatabases

- Identificationofinflectedforms

- Identificationofmisspelledwords

TextAnalyticsChallenge

THREE MAJOR OBSTACLES

1) Very large number of word forms

2) Polymorphy of languageOne idea ® multiple forms

TextAnalyticsChallenge

THREE MAJOR OBSTACLES

1) Very large number of word forms

2) Polymorphy of languageOne idea ® multiple forms

3) Polysemy of wordsOne word ® many ideas

TextAnalyticsChallenge

TextAnalyticsChallenge

Keyword in Context List (KWIC)

Senses of word “stress”#1 (psychology) a state of mental or emotional strain or suspense

#2 (physics) force that produces strain on a physical body

#3 Verb - single out as important

Challenge#3– Polysemy ofwords

Page 10: TEXT MINING CHALLENGES AND SOLUTIONS IN BIG · PDF file1/16/17 1 Dr. Normand Péladeau President& CEO ProvalisResearchCorp. peladeau@provalisresearch.com TEXT MINING CHALLENGES AND

1/16/17

10

Keyword in Context List (KWIC)

Disambiguation using phrasesSTRESS*_THE or STRESS*_THAT ® “single out as important”

UNDER_STRESS or THEIR STRESS ® Emotional State

Challenge#3– Polysemy ofwords ItemMatching Rules

IMPROPERMATCHINGRULES• Anymatchingitems• Firstitemencounteredinalphabeticallysortedlist

PROPERMATCHINGPRIORITYRULES• Firstitemencounteredinacarefullyarrangedlistor

• Longerphrasesovershorterphrases• Phrasesoverwords• Wordsoverwordpatterns• Longerwordpatternsovershorterones

RuleofThumb

PROPOSEDBYBENGSTON&XU(1995)

• Everysingleiteminadictionaryshouldproduceatleast80%oftruepositives(TP).

• Ifnot,trytoremovefalsepositives(FP)usingassociatedphrasesuntilFPitislessthan20%.

• IfTPstillbelow80%,removethewordfromthedictionaryandaddassociatedTPphrases.

CAUTION:The80%criteriadonottakeintoaccountcostsassociatedwithfalsenegatives.

RuleofThumb

Disambiguation using rulesTRANSFER* IS NEAR TECHNOLOGY

TRANSFER* IS NOT NEAR BUS

SATISFIED IS AFTER #NEGATION

Keyword in Context List (KWIC)

Challenge#3– Polysemy ofwords

1.8millionstudentcomments• Morethan35millionwords

• 78,159wordforms

• 46,404“unknown”wordso 75%misspellings(≈35,000)o 21%propernames(products&people)o 4%acronyms

Challenge#4– Misspellings

Page 11: TEXT MINING CHALLENGES AND SOLUTIONS IN BIG · PDF file1/16/17 1 Dr. Normand Péladeau President& CEO ProvalisResearchCorp. peladeau@provalisresearch.com TEXT MINING CHALLENGES AND

1/16/17

11

Challenge#4– Misspellings61waystobe“Enthusiastic”

Challenge#4– Misspellings

Fuzzy andphonetic stringcomparison algorithms:

• Damerau-Levenshtein• Koelner Phonetik• SoundEx• Metaphone• Double-Metaphone• NGram• Dice• Jaro-Winkler• Needleman-Wunch• Smith-Waterman-Gotoh• Monge-Elkan

Challenge#4– Misspellings

1) Training Phase

Classification Rules

2) Classification of documents

Classification Rules? ? ? ? ?

AutomaticDocumentClassification

SophisticatedMethodsofClassification• NaïveBayesClassifiers

– ProbabilisticclassifiersthatarebasedonBayes’theorem,whichstatesthattheprobabilityofanevent’soccurrenceisequaltotheintrinsicprobabilitytimestheprobabilitythatitwillhappenagain(naïve=simplisticassumptionthattheobjectsarecompletelyindependentofoneanother

• Rocchio Classification• K-NearestNeighborMethod

– AmethodtoclusterdocumentsbasedontheirdistancetotheKnearest“neighbor”documents

• SupportVectorMachine

AutomaticDocumentClassification• Algorithmicapproachtotextto:

– Recommendations/Predictions(Pandora/Amazon)– Classification(Knowndatatodefinenewdata=spam– Clustering(Newgroupsofsimilardata=GoogleNews)

• LargeDataSets(LargeNumbersofWordsorPhrases)– BagofWordsApproach– High-DimensionalVectorSpaces

• CommonMLalgorithmsfortextcategorization– ArtificialNeuralnetworks– Decisiontrees– SupportVectorMachines(SVM)

• SupervisedMachineLearning– Providingasetof“inputfeatures”(e.g.terms)canbeprovidedtohelp

enableMachineLearning(ML)– Aniterativeprocess,whereoutputsarecomparedwithknownvalues

• UnsupervisedMachineLearning– Classificationofdocumentswherethecategoriesofatestsetarenot

known

MachineLearning