text mining challenges and solutions in big · pdf file1/16/17 1 dr. normand péladeau...

1/16/17

1

Dr.NormandPéladeauPresident &CEO

Provalis Research [email protected]

TEXTMININGCHALLENGESANDSOLUTIONSINBIGDATA

Dr.DerrickL.CogburnHICSSGlobalVirtualTeamsMini-TrackCo-Chair

HICSSTextAnalyticsMini-TrackCo-ChairAssociateProfessor,SchoolofInternationalService

ExecutiveDirector,InstituteonDisabilityandPublicPolicyCOTELCO:TheCollaborationLaboratory

[email protected]

@derrickcogburn

Objectives

Attheendofthisworkshop,participantsshouldbeableto:

1. Understandthemainchallengestextanalystsarefacing.

2. Identifyvarioustextanalysisstrategiesandtechniquestodealwiththosechallenges.

3. Recognizetheirrespectivestrengthsandweaknesses.

4. Identifyvariousexploratorytextminingtechniques.

5. Applydictionaryconstructionandvalidationprinciples

Andifenoughtime6. Understandsomeofthebasicfeaturesofautomaticdocument

classificationtechniques

• PracticalTextMiningandStatisticalAnalysisforNon-StructuredTextDataApplications,GaryMineretal,AcademicPress/Elsevier(2012)(AvailableforKindle)

• RforEveryone:AdvancedAnalyticsandGraphics.JaredP.Lander.Addison-Wesley,2014.(AvailableforKindle)

• AnIntroductiontoDataScience,JeffreyStanton(2013).(FreeiBookorPDF)http://jsresearch.net/wiki/projects/teachdatascience

• Hadoop:TheEnginethatDrivesBigData,LarsNielsen.ExecutiveSummary(AvailableforKindle)NewStreetCommunications(2013)

• TextMiningforQualitativeDataAnalysisintheSocialSciences.Gregor Wiedemann (2016).Springer.

RecommendedTexts

• Bits=BinaryDigit(0,1)• Nibble=4bits• Byte=8bits(256combinations)

– (NB:KBv.Kb)storagev.transmission

Meaning 2nd Digit 1st Digit

No 0 0

Maybe 0 1

Probably 1 0

Definitely 1 1

Source:JeffStanton,IntroductiontoDataScience

Atwobitcodebookforresponsetoaninvitation

Understandingthe“data”inBigData

• Comparisonoffilesizes:– Kilobyte (KB)=1,024bytes(2-3paragraphsofplaintext)– Megabyte (MB)=1,048,576bytesor1,024Kilobytes(873pagesofplaintext)– Gigabyte (GB)1,073,741,824(230)bytes.1,024Megabytes,or1,048,576Kilobytes

(894,784pagesofplaintext)– Terabyte (TB)1,099,511,627,776(240)bytes,1,024Gigabytes,or1,048,576Megabytes

(916,259,689pagesofplaintext)– Petabyte (PB)1,125,899,906,842,624(250)bytes,1,024Terabytes,1,048,576Gigabytes,

or1,073,741,824Megabytes(938,249,922,368pagesofplaintext)– Exabyte (EB)1,152,921,504,606,846,976(260)bytes,1,024Petabytes,1,048,576

Terabytes,1,073,741,824Gigabytes,or1,099,511,627,776Megabytes(960,767,920,505,705pagesofplaintext)

– Zettabyte (ZB)1,180,591,620,717,411,303,424(270)bytes,1,024Exabytes,1,048,576Petabytes,1,073,741,824Terabytes,1,099,511,627,776Gigabytes,or1,125,899,910,000,000Megabytes(983,826,350,597,842,752pagesofplaintext)

– Yottabyte (YB)1,208,925,819,614,629,174,706,176(280)bytes,1,024Zettabytes,1,048,576Exabytes,1,073,741,824Petabytes,1,099,511,627,776Terabytes,1,125,899,910,000,000Gigabytes,or1,152,921,500,000,000,000Megabytes(1,007,438,183,012,190,978,921pagesofplaintext)

Source:http://www.computerhope.com/issues/chspace.htm

Understandingthe“big”inBigData

• “BigData”isarelativeterm– itmeansdifferentthingstodifferentpeople/disciplines:

• Whenwetalkof“Big”data,wemean“big”lessinabsolutetermsandmoreintermsrelativetothecomprehensivenatureofthedata.

• 75-80%oftheworld’savailabledataisunstructuredtext(unstructuredinformationgrowingat15timesstructured)

• “Inthepast50years,theNewYorkTimesproduced3billionwords”and“Twitterusersproduce8billionwords– everysingleday”(Kalev Leetaru,UniversityofIllinois,andKaisler,Armour,Espinosa,andMoney,2014)

• Inadditiontotext(websites,blogs,socialmedia,emailarchives,annualreports,meetingtranscripts,publishedarticles–newspapersandjournals)thereareimage,video,audio,GPS,RFID,andothertypesofBigData

Definingthe“big”inBigData

1/16/17

2

• The“ThreeVs Model”ofBigData

• Volume =theamountofavailabledata• Velocity =speedatwhichdataarrives/decays• Variety =differenttypesofdata

– Plus:

• Veracity =accuracyofthedata• Variability =differinginterpretationsofthedata• Value =relativeimportanceofthedata

Source:DougLaney,BusinessAnalyst,Gartner

Definingthe“big”inBigData• ToolsandTechniquesforscrapingsites– Software

• Sitesucker*• PageSucker• WebGrabber• WebDumper

– HandCoding• Perl• Python• Ruby

• Downloadingemaillistservearchives– Mbox format,Gmail,etc.

• TwitterAPIs– dev.twitter.com

• ProvalisSocialMediaScraper– WebCollector

• Newspapersandpublishedarticles– e-libraryresources,Lexis-Nexis,etc. *

AcquiringTextData

TextAnalyticsApplications• SentimentAnalysis(socialmedia)• VoiceoftheCustomer(emails,chat,callcentertranscripts)• Productimprovement(warrantyclaims)• CompetitiveIntelligence(patents,websites)• Riskmanagement(incidentormaintenancereports)• Frauddetection(insuranceclaims)• Reputationmanagement(news,blogs,socialmedia)• Scientometrics studies(journalarticles,titles&abstracts)• Crimeanalysis(narratives,computerforensics,testimonies)• Surveyanalysis(open-endedquestions)• Financialprediction(earningsreleases,news,pressreleases)• Surveillancesystem(communication,medicalreports)•Manymore...

TextMiningintheWorldDataSciences

FOURAPPROCHESTOTEXTANALYSIS

1. ComputerAssistedQualitativeAnalysis

2. Exploratorytextmining

3. QuantitativeContentAnalysis

4. AutomaticDocumentClassification

TextAnalysisLandscape

• Each text analysis method has its own strengths and weaknesses.

• No single method is appropriate for all text analysis tasks.

• A single text analysis task often benefit from combining several methods.

OurProposal

1/16/17

3

Toolswewilluse

ContentAnalysis&TextMining

QualitativeAnalysis&MixedMethods

SomeothertoolsavailableCommercialtools

IBMTextModeler forText,IBMWatsonSASTextMiner,Clarabridge,Lexalytics,AlchemyAPI,Attensity,Enkata,OdinText,etc.

Open-sourcetoolsTextminingmodulesRprogrammingmodules(R/tm)Gensim,Mallet,Quanteda,RapidMiner,Gate,KNIME

NLPLibrariesStanfordNLP,NaturalLanguageTookit (NLTK)OpenCalaisApacheOpenNLP,etc.

http://www.kdnuggets.com/software/text.html

Iamonly demon-trating theannota-tion feature,so youdon’t need toreadthis!

Still reading

Computer AssistedQualitativeAnalysis Computer AssistedQualitativeAnalysis

Necessity is the mother of inventionLaziness

THREE MAJOR OBSTACLES

1) Very large number of word forms

2) Polymorphy of languageOne idea ® multiple forms

3) Polysemy of wordsOne word ® many ideas

TextAnalyticsChallenge

1/16/17

4





TextAnalyticsChallenge Challenge#1– Quantity

38,996commentsabouthotels• 2,1millionwords(tokens)• 20,116termsorwordforms(types)

1,8millioncourseevaluations• 35millionswords(tokens)• 78,159termsorwordforms(types)

TextAnalyticsChallenge TextAnalyticsChallenge


• Theorderofthewordsinthedocumentdoesnotmatter

• Whilea“bigassumption”textminingexpertshavefoundthattheycanstilldifferentiatebetweensemanticconceptsbyusingallthewordsinthedocuments

• Donotworkinallsituationsandsomeinformationextractiontasksandnaturallanguageprocessingreliesheavilyonthewordsthemselves(e.g.partofspeechtagging)andtheorderofthewords(precedingandfollowing)

• Specializedalgorithmsareusedinthesecases

The “bagofwords”assumption

1/16/17

5

Challenge#1– SolutionsLinguisticPre-processing

• Removalofstopwords

• Stemming

• Lemmatization

Challenge#1– StopWords

RemovalofStopWords• Wordsthatareeitherinsignificant(i.e.,articles,prepositions)ortoocommon

• Examples:“the”,“and”,“or”,“a”,“of”,“to”,“at”,“is”,“it”,“have”,“who”,etc.

• Caution:usewithcare“IT”as“informationtechnology”

“TheWho”,“takethat”,“amust”

Negation:not,no,never,seldom,no,etc

“but”,“however”,“otherwise”

Challenge#1– StemmingStemming- Removalofcommonprefixesandsuffixesto

obtainawordstem

Example:prefix– stem– suffixun– avail– able

Issue:Stemmingerrors

universal,university,universe->univers

designate,design->design

paste,past->past

political,polite->polit

several,severance->sever

Challenge#1– Lemmatization

Lemmatization:Reducinginflectedformsofwordstotheircanonicalform.

Examples: walk,walks,walked,waking->walkam,are,is->be

Twoforms:1.Linguistic(veryslowbutmoreprecise)2.Statistical(fastbutlessaccurate)

Issue- Somelossinsemanticprecision• Differentusesofsingularvs pluralforms

• Differentusesofverbtenses

Challenge#1– SolutionsLinguisticPre-processing

• Removalofstopwords

• Stemming

• Lemmatization

Statisticaltools• Frequencyselection

• Datareductiontechniques(HCA,PCA,FA)

• Exploratorydataanalysis(ex.CA).

• MachineLearning

TheStatisticsofTextDistributionofwords:Zipf distribution

1/16/17

6

TheStatisticsofText

38,988commentsabouthotels2.1Mwords(20,114differentterms)

MOST FREQUENTTERMS

PERCENTAGE OFTERMS

PERCENTAGE OFWORDS

49 0.24% 50%

300 1.5% 76%

500 2.5% 83%

1000 5.0% 90%


TFxIDF – Termfrequencyxinversedocumentfrequency

Heuristictechniqueforselectingwordsthatareimportantinacorpus

Principles:Ifawordappearsfrequentlyinadocument,it’simportantIfawordappearsinmanydocuments,itislessimportant

Basicformula:ft,d xlog(N/nt)

PROS

• Identificationoftopics&structureoftopics

• Maybeusedtoreducedimensionality

• Tendstogroupsynonyms(polymorphy)

CONS• Doesnotdealadequatelywithpolysemy ofwords

• Nosinglebestsolution

(morelater)

HierarchicalClustering

PROS

• Fastidentificationoftopics

• Reducedimensionality

• Dealpartiallywithsynonymy&polysemy

CONS

• Nosinglebestsolution

(morelater)

TopicModeling(LSA,pLSA,LDA,PAM,etc.)

PROS• Veryfast• Verylittleefforts• Inductive

CONS• Comparabilityofresults• Imprecisequantification• Insensitivetolowfrequencyevents• Sensitivetostructuredtextelements• Inductive

Text MiningApproach




1/16/17

7




TextAnalyticsChallenge Contentanalysismethod

Contentanalysismethod Contentanalysismethod

CreationofadictionaryCustom Dictionary forCourseEvaluation CreationofadictionaryCustom Dictionary forSDGsandPWDs

• SDG Goal 4: Education = 2– 4.5 By 2030, eliminate gender disparities

in education and ensure equal access to all levels of education and vocational training for the vulnerable, including persons with disabilities, indigenous peoples and children in vulnerable situations

– 4.a. Build and upgrade education facilities that are child, disability and gender sensitive and provide safe, nonviolent, inclusive and effective learning environments for all

• Goal4.5:– GenderDisparity

• Keywords• Keyphrases

– EqualAccesstoEducation

– EqualAccesstoVocationalTraining

• Goal4.a– Facilities– SafeEnvironment– InclusiveEnvironment– EffectiveLearning

1/16/17

8

CreationofadictionaryCustom Dictionary forSDGsandPWDs CreationofadictionaryCustom Dictionary forSDGsandPWDs

CreationofadictionaryCustom Dictionary forSDGsandPWDs ContentanalysismethodPROS

• Canpotentiallymeasuremoreaccurately• Canbefocused(multi-focus)• Allowsfullautomation• Allowscomparison(overtime– acrosstextcollections)• Allowsmeasurementsoflatentdimensions• Publiclyorcommerciallyavailabledictionaries• Deductiveapproach

PSYCHOMETRICMEASUREMENT

• LinguisticInquiryandWordCount(LIWC)- Pennebaker

• RegressiveImageryDictionary(RID)– Martindale

• CommunicationVaguenessDictionary– Hiller

• BrandPersonalityDictionary- Opoku

SOCIO-POLITICALMEASUREMENT

• DICTION- Hart• Lasswell ValueDictionary- Lasswell• GeneralInquirerHarvardIV- Stone

MeasureLatentDimensions MeasureLatentDimensions

COMMUNICATIONVAGUENESSDICTIONARY

1/16/17

9

ContentanalysismethodPROS

• Canpotentiallymeasuremoreaccurately• Canbefocused(multi-focus)• Allowsfullautomation• Allowscomparison(overtime– acrosstextcollection)• Allowsmeasurementsoflatentdimensions• Publiclyorcommerciallyavailabledictionaries• Deductive

CONS• Timerequiredforconstruction&validation• Improperuseofexistingdictionaries

Toolsfordictionaryconstruction

- Clustering&topicmodeling

- Frequencylistofwords

- Phraseextraction

- Namedentityrecognition(NER)

- Thesauri&lexicaldatabases

- Identificationofinflectedforms

- Identificationofmisspelledwords












Keyword in Context List (KWIC)

Senses of word “stress”#1 (psychology) a state of mental or emotional strain or suspense

#2 (physics) force that produces strain on a physical body

#3 Verb - single out as important

Challenge#3– Polysemy ofwords

1/16/17

10


Disambiguation using phrasesSTRESS*_THE or STRESS*_THAT ® “single out as important”

UNDER_STRESS or THEIR STRESS ® Emotional State

Challenge#3– Polysemy ofwords ItemMatching Rules

IMPROPERMATCHINGRULES• Anymatchingitems• Firstitemencounteredinalphabeticallysortedlist

PROPERMATCHINGPRIORITYRULES• Firstitemencounteredinacarefullyarrangedlistor

• Longerphrasesovershorterphrases• Phrasesoverwords• Wordsoverwordpatterns• Longerwordpatternsovershorterones

RuleofThumb

PROPOSEDBYBENGSTON&XU(1995)

• Everysingleiteminadictionaryshouldproduceatleast80%oftruepositives(TP).

• Ifnot,trytoremovefalsepositives(FP)usingassociatedphrasesuntilFPitislessthan20%.

• IfTPstillbelow80%,removethewordfromthedictionaryandaddassociatedTPphrases.

CAUTION:The80%criteriadonottakeintoaccountcostsassociatedwithfalsenegatives.

RuleofThumb

Disambiguation using rulesTRANSFER* IS NEAR TECHNOLOGY

TRANSFER* IS NOT NEAR BUS

SATISFIED IS AFTER #NEGATION


Challenge#3– Polysemy ofwords

1.8millionstudentcomments• Morethan35millionwords

• 78,159wordforms

• 46,404“unknown”wordso 75%misspellings(≈35,000)o 21%propernames(products&people)o 4%acronyms

Challenge#4– Misspellings

1/16/17

11

Challenge#4– Misspellings61waystobe“Enthusiastic”


Fuzzy andphonetic stringcomparison algorithms:

• Damerau-Levenshtein• Koelner Phonetik• SoundEx• Metaphone• Double-Metaphone• NGram• Dice• Jaro-Winkler• Needleman-Wunch• Smith-Waterman-Gotoh• Monge-Elkan


1) Training Phase

Classification Rules

2) Classification of documents

Classification Rules? ? ? ? ?

AutomaticDocumentClassification

SophisticatedMethodsofClassification• NaïveBayesClassifiers

– ProbabilisticclassifiersthatarebasedonBayes’theorem,whichstatesthattheprobabilityofanevent’soccurrenceisequaltotheintrinsicprobabilitytimestheprobabilitythatitwillhappenagain(naïve=simplisticassumptionthattheobjectsarecompletelyindependentofoneanother

• Rocchio Classification• K-NearestNeighborMethod

– AmethodtoclusterdocumentsbasedontheirdistancetotheKnearest“neighbor”documents

• SupportVectorMachine

AutomaticDocumentClassification• Algorithmicapproachtotextto:

– Recommendations/Predictions(Pandora/Amazon)– Classification(Knowndatatodefinenewdata=spam– Clustering(Newgroupsofsimilardata=GoogleNews)

• LargeDataSets(LargeNumbersofWordsorPhrases)– BagofWordsApproach– High-DimensionalVectorSpaces

• CommonMLalgorithmsfortextcategorization– ArtificialNeuralnetworks– Decisiontrees– SupportVectorMachines(SVM)

• SupervisedMachineLearning– Providingasetof“inputfeatures”(e.g.terms)canbeprovidedtohelp

enableMachineLearning(ML)– Aniterativeprocess,whereoutputsarecomparedwithknownvalues

• UnsupervisedMachineLearning– Classificationofdocumentswherethecategoriesofatestsetarenot

known

MachineLearning

text mining challenges and solutions in big · pdf file1/16/17 1 dr. normand péladeau...

Documents