text mining challenges and solutions in big · pdf file1/16/17 1 dr. normand péladeau...
TRANSCRIPT
1/16/17
1
Dr.NormandPéladeauPresident &CEO
Provalis Research [email protected]
TEXTMININGCHALLENGESANDSOLUTIONSINBIGDATA
Dr.DerrickL.CogburnHICSSGlobalVirtualTeamsMini-TrackCo-Chair
HICSSTextAnalyticsMini-TrackCo-ChairAssociateProfessor,SchoolofInternationalService
ExecutiveDirector,InstituteonDisabilityandPublicPolicyCOTELCO:TheCollaborationLaboratory
@derrickcogburn
Objectives
Attheendofthisworkshop,participantsshouldbeableto:
1. Understandthemainchallengestextanalystsarefacing.
2. Identifyvarioustextanalysisstrategiesandtechniquestodealwiththosechallenges.
3. Recognizetheirrespectivestrengthsandweaknesses.
4. Identifyvariousexploratorytextminingtechniques.
5. Applydictionaryconstructionandvalidationprinciples
Andifenoughtime6. Understandsomeofthebasicfeaturesofautomaticdocument
classificationtechniques
• PracticalTextMiningandStatisticalAnalysisforNon-StructuredTextDataApplications,GaryMineretal,AcademicPress/Elsevier(2012)(AvailableforKindle)
• RforEveryone:AdvancedAnalyticsandGraphics.JaredP.Lander.Addison-Wesley,2014.(AvailableforKindle)
• AnIntroductiontoDataScience,JeffreyStanton(2013).(FreeiBookorPDF)http://jsresearch.net/wiki/projects/teachdatascience
• Hadoop:TheEnginethatDrivesBigData,LarsNielsen.ExecutiveSummary(AvailableforKindle)NewStreetCommunications(2013)
• TextMiningforQualitativeDataAnalysisintheSocialSciences.Gregor Wiedemann (2016).Springer.
RecommendedTexts
• Bits=BinaryDigit(0,1)• Nibble=4bits• Byte=8bits(256combinations)
– (NB:KBv.Kb)storagev.transmission
Meaning 2nd Digit 1st Digit
No 0 0
Maybe 0 1
Probably 1 0
Definitely 1 1
Source:JeffStanton,IntroductiontoDataScience
Atwobitcodebookforresponsetoaninvitation
Understandingthe“data”inBigData
• Comparisonoffilesizes:– Kilobyte (KB)=1,024bytes(2-3paragraphsofplaintext)– Megabyte (MB)=1,048,576bytesor1,024Kilobytes(873pagesofplaintext)– Gigabyte (GB)1,073,741,824(230)bytes.1,024Megabytes,or1,048,576Kilobytes
(894,784pagesofplaintext)– Terabyte (TB)1,099,511,627,776(240)bytes,1,024Gigabytes,or1,048,576Megabytes
(916,259,689pagesofplaintext)– Petabyte (PB)1,125,899,906,842,624(250)bytes,1,024Terabytes,1,048,576Gigabytes,
or1,073,741,824Megabytes(938,249,922,368pagesofplaintext)– Exabyte (EB)1,152,921,504,606,846,976(260)bytes,1,024Petabytes,1,048,576
Terabytes,1,073,741,824Gigabytes,or1,099,511,627,776Megabytes(960,767,920,505,705pagesofplaintext)
– Zettabyte (ZB)1,180,591,620,717,411,303,424(270)bytes,1,024Exabytes,1,048,576Petabytes,1,073,741,824Terabytes,1,099,511,627,776Gigabytes,or1,125,899,910,000,000Megabytes(983,826,350,597,842,752pagesofplaintext)
– Yottabyte (YB)1,208,925,819,614,629,174,706,176(280)bytes,1,024Zettabytes,1,048,576Exabytes,1,073,741,824Petabytes,1,099,511,627,776Terabytes,1,125,899,910,000,000Gigabytes,or1,152,921,500,000,000,000Megabytes(1,007,438,183,012,190,978,921pagesofplaintext)
Source:http://www.computerhope.com/issues/chspace.htm
Understandingthe“big”inBigData
• “BigData”isarelativeterm– itmeansdifferentthingstodifferentpeople/disciplines:
• Whenwetalkof“Big”data,wemean“big”lessinabsolutetermsandmoreintermsrelativetothecomprehensivenatureofthedata.
• 75-80%oftheworld’savailabledataisunstructuredtext(unstructuredinformationgrowingat15timesstructured)
• “Inthepast50years,theNewYorkTimesproduced3billionwords”and“Twitterusersproduce8billionwords– everysingleday”(Kalev Leetaru,UniversityofIllinois,andKaisler,Armour,Espinosa,andMoney,2014)
• Inadditiontotext(websites,blogs,socialmedia,emailarchives,annualreports,meetingtranscripts,publishedarticles–newspapersandjournals)thereareimage,video,audio,GPS,RFID,andothertypesofBigData
Definingthe“big”inBigData
1/16/17
2
• The“ThreeVs Model”ofBigData
• Volume =theamountofavailabledata• Velocity =speedatwhichdataarrives/decays• Variety =differenttypesofdata
– Plus:
• Veracity =accuracyofthedata• Variability =differinginterpretationsofthedata• Value =relativeimportanceofthedata
Source:DougLaney,BusinessAnalyst,Gartner
Definingthe“big”inBigData• ToolsandTechniquesforscrapingsites– Software
• Sitesucker*• PageSucker• WebGrabber• WebDumper
– HandCoding• Perl• Python• Ruby
• Downloadingemaillistservearchives– Mbox format,Gmail,etc.
• TwitterAPIs– dev.twitter.com
• ProvalisSocialMediaScraper– WebCollector
• Newspapersandpublishedarticles– e-libraryresources,Lexis-Nexis,etc. *
AcquiringTextData
TextAnalyticsApplications• SentimentAnalysis(socialmedia)• VoiceoftheCustomer(emails,chat,callcentertranscripts)• Productimprovement(warrantyclaims)• CompetitiveIntelligence(patents,websites)• Riskmanagement(incidentormaintenancereports)• Frauddetection(insuranceclaims)• Reputationmanagement(news,blogs,socialmedia)• Scientometrics studies(journalarticles,titles&abstracts)• Crimeanalysis(narratives,computerforensics,testimonies)• Surveyanalysis(open-endedquestions)• Financialprediction(earningsreleases,news,pressreleases)• Surveillancesystem(communication,medicalreports)•Manymore...
TextMiningintheWorldDataSciences
FOURAPPROCHESTOTEXTANALYSIS
1. ComputerAssistedQualitativeAnalysis
2. Exploratorytextmining
3. QuantitativeContentAnalysis
4. AutomaticDocumentClassification
TextAnalysisLandscape
• Each text analysis method has its own strengths and weaknesses.
• No single method is appropriate for all text analysis tasks.
• A single text analysis task often benefit from combining several methods.
OurProposal
1/16/17
3
Toolswewilluse
ContentAnalysis&TextMining
QualitativeAnalysis&MixedMethods
SomeothertoolsavailableCommercialtools
IBMTextModeler forText,IBMWatsonSASTextMiner,Clarabridge,Lexalytics,AlchemyAPI,Attensity,Enkata,OdinText,etc.
Open-sourcetoolsTextminingmodulesRprogrammingmodules(R/tm)Gensim,Mallet,Quanteda,RapidMiner,Gate,KNIME
NLPLibrariesStanfordNLP,NaturalLanguageTookit (NLTK)OpenCalaisApacheOpenNLP,etc.
http://www.kdnuggets.com/software/text.html
Iamonly demon-trating theannota-tion feature,so youdon’t need toreadthis!
Still reading
Computer AssistedQualitativeAnalysis Computer AssistedQualitativeAnalysis
Necessity is the mother of inventionLaziness
THREE MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of languageOne idea ® multiple forms
3) Polysemy of wordsOne word ® many ideas
TextAnalyticsChallenge
1/16/17
4
THREE MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of languageOne idea ® multiple forms
3) Polysemy of wordsOne word ® many ideas
TextAnalyticsChallenge Challenge#1– Quantity
38,996commentsabouthotels• 2,1millionwords(tokens)• 20,116termsorwordforms(types)
1,8millioncourseevaluations• 35millionswords(tokens)• 78,159termsorwordforms(types)
TextAnalyticsChallenge TextAnalyticsChallenge
TextAnalyticsChallenge
• Theorderofthewordsinthedocumentdoesnotmatter
• Whilea“bigassumption”textminingexpertshavefoundthattheycanstilldifferentiatebetweensemanticconceptsbyusingallthewordsinthedocuments
• Donotworkinallsituationsandsomeinformationextractiontasksandnaturallanguageprocessingreliesheavilyonthewordsthemselves(e.g.partofspeechtagging)andtheorderofthewords(precedingandfollowing)
• Specializedalgorithmsareusedinthesecases
The “bagofwords”assumption
1/16/17
5
Challenge#1– SolutionsLinguisticPre-processing
• Removalofstopwords
• Stemming
• Lemmatization
Challenge#1– StopWords
RemovalofStopWords• Wordsthatareeitherinsignificant(i.e.,articles,prepositions)ortoocommon
• Examples:“the”,“and”,“or”,“a”,“of”,“to”,“at”,“is”,“it”,“have”,“who”,etc.
• Caution:usewithcare“IT”as“informationtechnology”
“TheWho”,“takethat”,“amust”
Negation:not,no,never,seldom,no,etc
“but”,“however”,“otherwise”
Challenge#1– StemmingStemming- Removalofcommonprefixesandsuffixesto
obtainawordstem
Example:prefix– stem– suffixun– avail– able
Issue:Stemmingerrors
universal,university,universe->univers
designate,design->design
paste,past->past
political,polite->polit
several,severance->sever
Challenge#1– Lemmatization
Lemmatization:Reducinginflectedformsofwordstotheircanonicalform.
Examples: walk,walks,walked,waking->walkam,are,is->be
Twoforms:1.Linguistic(veryslowbutmoreprecise)2.Statistical(fastbutlessaccurate)
Issue- Somelossinsemanticprecision• Differentusesofsingularvs pluralforms
• Differentusesofverbtenses
Challenge#1– SolutionsLinguisticPre-processing
• Removalofstopwords
• Stemming
• Lemmatization
Statisticaltools• Frequencyselection
• Datareductiontechniques(HCA,PCA,FA)
• Exploratorydataanalysis(ex.CA).
• MachineLearning
TheStatisticsofTextDistributionofwords:Zipf distribution
1/16/17
6
TheStatisticsofText
38,988commentsabouthotels2.1Mwords(20,114differentterms)
MOST FREQUENTTERMS
PERCENTAGE OFTERMS
PERCENTAGE OFWORDS
49 0.24% 50%
300 1.5% 76%
500 2.5% 83%
1000 5.0% 90%
TextAnalyticsChallenge
TFxIDF – Termfrequencyxinversedocumentfrequency
Heuristictechniqueforselectingwordsthatareimportantinacorpus
Principles:Ifawordappearsfrequentlyinadocument,it’simportantIfawordappearsinmanydocuments,itislessimportant
Basicformula:ft,d xlog(N/nt)
PROS
• Identificationoftopics&structureoftopics
• Maybeusedtoreducedimensionality
• Tendstogroupsynonyms(polymorphy)
CONS• Doesnotdealadequatelywithpolysemy ofwords
• Nosinglebestsolution
(morelater)
HierarchicalClustering
PROS
• Fastidentificationoftopics
• Reducedimensionality
• Dealpartiallywithsynonymy&polysemy
CONS
• Nosinglebestsolution
(morelater)
TopicModeling(LSA,pLSA,LDA,PAM,etc.)
PROS• Veryfast• Verylittleefforts• Inductive
CONS• Comparabilityofresults• Imprecisequantification• Insensitivetolowfrequencyevents• Sensitivetostructuredtextelements• Inductive
Text MiningApproach
THREE MAJOR OBSTACLES
1) Very large number of word forms
TextAnalyticsChallenge
1/16/17
7
THREE MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of languageOne idea ® multiple forms
TextAnalyticsChallenge Contentanalysismethod
Contentanalysismethod Contentanalysismethod
CreationofadictionaryCustom Dictionary forCourseEvaluation CreationofadictionaryCustom Dictionary forSDGsandPWDs
• SDG Goal 4: Education = 2– 4.5 By 2030, eliminate gender disparities
in education and ensure equal access to all levels of education and vocational training for the vulnerable, including persons with disabilities, indigenous peoples and children in vulnerable situations
– 4.a. Build and upgrade education facilities that are child, disability and gender sensitive and provide safe, nonviolent, inclusive and effective learning environments for all
• Goal4.5:– GenderDisparity
• Keywords• Keyphrases
– EqualAccesstoEducation
– EqualAccesstoVocationalTraining
• Goal4.a– Facilities– SafeEnvironment– InclusiveEnvironment– EffectiveLearning
1/16/17
8
CreationofadictionaryCustom Dictionary forSDGsandPWDs CreationofadictionaryCustom Dictionary forSDGsandPWDs
CreationofadictionaryCustom Dictionary forSDGsandPWDs ContentanalysismethodPROS
• Canpotentiallymeasuremoreaccurately• Canbefocused(multi-focus)• Allowsfullautomation• Allowscomparison(overtime– acrosstextcollections)• Allowsmeasurementsoflatentdimensions• Publiclyorcommerciallyavailabledictionaries• Deductiveapproach
PSYCHOMETRICMEASUREMENT
• LinguisticInquiryandWordCount(LIWC)- Pennebaker
• RegressiveImageryDictionary(RID)– Martindale
• CommunicationVaguenessDictionary– Hiller
• BrandPersonalityDictionary- Opoku
SOCIO-POLITICALMEASUREMENT
• DICTION- Hart• Lasswell ValueDictionary- Lasswell• GeneralInquirerHarvardIV- Stone
MeasureLatentDimensions MeasureLatentDimensions
COMMUNICATIONVAGUENESSDICTIONARY
1/16/17
9
ContentanalysismethodPROS
• Canpotentiallymeasuremoreaccurately• Canbefocused(multi-focus)• Allowsfullautomation• Allowscomparison(overtime– acrosstextcollection)• Allowsmeasurementsoflatentdimensions• Publiclyorcommerciallyavailabledictionaries• Deductive
CONS• Timerequiredforconstruction&validation• Improperuseofexistingdictionaries
Toolsfordictionaryconstruction
- Clustering&topicmodeling
- Frequencylistofwords
- Phraseextraction
- Namedentityrecognition(NER)
- Thesauri&lexicaldatabases
- Identificationofinflectedforms
- Identificationofmisspelledwords
TextAnalyticsChallenge
THREE MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of languageOne idea ® multiple forms
TextAnalyticsChallenge
THREE MAJOR OBSTACLES
1) Very large number of word forms
2) Polymorphy of languageOne idea ® multiple forms
3) Polysemy of wordsOne word ® many ideas
TextAnalyticsChallenge
TextAnalyticsChallenge
Keyword in Context List (KWIC)
Senses of word “stress”#1 (psychology) a state of mental or emotional strain or suspense
#2 (physics) force that produces strain on a physical body
#3 Verb - single out as important
Challenge#3– Polysemy ofwords
1/16/17
10
Keyword in Context List (KWIC)
Disambiguation using phrasesSTRESS*_THE or STRESS*_THAT ® “single out as important”
UNDER_STRESS or THEIR STRESS ® Emotional State
Challenge#3– Polysemy ofwords ItemMatching Rules
IMPROPERMATCHINGRULES• Anymatchingitems• Firstitemencounteredinalphabeticallysortedlist
PROPERMATCHINGPRIORITYRULES• Firstitemencounteredinacarefullyarrangedlistor
• Longerphrasesovershorterphrases• Phrasesoverwords• Wordsoverwordpatterns• Longerwordpatternsovershorterones
RuleofThumb
PROPOSEDBYBENGSTON&XU(1995)
• Everysingleiteminadictionaryshouldproduceatleast80%oftruepositives(TP).
• Ifnot,trytoremovefalsepositives(FP)usingassociatedphrasesuntilFPitislessthan20%.
• IfTPstillbelow80%,removethewordfromthedictionaryandaddassociatedTPphrases.
CAUTION:The80%criteriadonottakeintoaccountcostsassociatedwithfalsenegatives.
RuleofThumb
Disambiguation using rulesTRANSFER* IS NEAR TECHNOLOGY
TRANSFER* IS NOT NEAR BUS
SATISFIED IS AFTER #NEGATION
Keyword in Context List (KWIC)
Challenge#3– Polysemy ofwords
1.8millionstudentcomments• Morethan35millionwords
• 78,159wordforms
• 46,404“unknown”wordso 75%misspellings(≈35,000)o 21%propernames(products&people)o 4%acronyms
Challenge#4– Misspellings
1/16/17
11
Challenge#4– Misspellings61waystobe“Enthusiastic”
Challenge#4– Misspellings
Fuzzy andphonetic stringcomparison algorithms:
• Damerau-Levenshtein• Koelner Phonetik• SoundEx• Metaphone• Double-Metaphone• NGram• Dice• Jaro-Winkler• Needleman-Wunch• Smith-Waterman-Gotoh• Monge-Elkan
Challenge#4– Misspellings
1) Training Phase
Classification Rules
2) Classification of documents
Classification Rules? ? ? ? ?
AutomaticDocumentClassification
SophisticatedMethodsofClassification• NaïveBayesClassifiers
– ProbabilisticclassifiersthatarebasedonBayes’theorem,whichstatesthattheprobabilityofanevent’soccurrenceisequaltotheintrinsicprobabilitytimestheprobabilitythatitwillhappenagain(naïve=simplisticassumptionthattheobjectsarecompletelyindependentofoneanother
• Rocchio Classification• K-NearestNeighborMethod
– AmethodtoclusterdocumentsbasedontheirdistancetotheKnearest“neighbor”documents
• SupportVectorMachine
AutomaticDocumentClassification• Algorithmicapproachtotextto:
– Recommendations/Predictions(Pandora/Amazon)– Classification(Knowndatatodefinenewdata=spam– Clustering(Newgroupsofsimilardata=GoogleNews)
• LargeDataSets(LargeNumbersofWordsorPhrases)– BagofWordsApproach– High-DimensionalVectorSpaces
• CommonMLalgorithmsfortextcategorization– ArtificialNeuralnetworks– Decisiontrees– SupportVectorMachines(SVM)
• SupervisedMachineLearning– Providingasetof“inputfeatures”(e.g.terms)canbeprovidedtohelp
enableMachineLearning(ML)– Aniterativeprocess,whereoutputsarecomparedwithknownvalues
• UnsupervisedMachineLearning– Classificationofdocumentswherethecategoriesofatestsetarenot
known
MachineLearning