-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
1/53
Fron%ersofComputa%onalJournalism
ColumbiaJournalismSchool
Week2:TextAnalysis
September11,2013
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
2/53
Lecture2:TextAnalysis
Whatistextanalysis?
VectorSpaceModelandCosineDistance
TF-IDF
TopicModels
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
3/53
Basicidea:quan%ta%veinforma%oncantellstories
WhenHuJintaocametopowerin2002,Chinawasalreadyexperiencingaworsening
socialcrisis.In2004,PresidentHuofferedarhetoricalresponsetogrowinginternal
instability,trumpe%ngwhathecalledaharmonioussociety.Forsome%me,thisnew
watchwordburgeoned,becomingvisibleeverywhereinthePartyspropaganda.
-QianGang,Watchwords:ReadingChinathroughitsPartyVocabulary
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
4/53
Butby2007itwasalreadyonthedecline,asstabilitypreserva%onmadeitsrapid
ascent....Together,thesecontras%ngpicturesoftheharmonioussocietyandstabilitypreserva%onformaportraitoftherealpredicamentfacingPresidentHuJintao.A
harmonioussocietymaybeapleasingidea,butitstheironwillbehindstability
preserva%onthatpackstherealpunch.
-QianGang,Watchwords:ReadingChinathroughitsPartyVocabulary
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
5/53
Googlengramsviewer12ofallbookseverpublished
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
6/53
Datacangiveawiderview
LetmetalkaboutDowntonAbbeyforaminute.Theshow'spopularityhasledmanynitpickerstodra^uplistsofmistakes....
Butallofthesehaverelied,sofarasIcantell,onfindinga
phraseortwothatsoundsabitoff,andcheckingtheonline
sourcesforearliestuse.
Ilacksuchsocialgraces.SoIthought:whynotjustcheckevery
singlelineintheshowforhistoricalaccuracy?...SoIfoundsome
copiesoftheDowntonAbbeyscriptsonline,andfedeverysingle
two-wordphrasethroughtheGoogleNgramdatabasetosee
howcharacteris%coftheEnglishLanguage,c.1917,Downton
Abbeyreallyis.
-BenSchmidt,MakingDowntonmoretradi=onal
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
7/53
BigramsthatdonotappearinEnglishbooksbetween1912and
1921.
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
8/53
Bigramsthatareatleast100%mesmorecommontodaythan
theywerein1912-1921
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
9/53
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
10/53
Lecture2:TextAnalysis
Whatistextanalysis?
VectorSpaceModelandCosineDistance
TF-IDF
TopicModels
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
11/53
Documents,notwords
Wecanuseclusteringtechniquesifwecan
convertdocumentsintovectors.
Asbefore,wewanttofindnumericalfeatures
thatdescribethedocument.
Howdowecapturethemeaningofadocument
innumbers?
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
12/53
Whatisthisdocument"about"?
Mostcommonlyoccurringwordsapreygoodindicator.!
30 !the!23 !to!19 !and!19 !a!18 !animal!17 !cruelty!15 !of!15 !crimes!14 !in!14 !for!11 !that!8 !crime!7 !we!
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
13/53
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
14/53
Turnsoutfeatures=wordsworksfine
Encodeeachdocumentasthelistofwordsit
contains.
Dimensions=vocabularyofdocumentset.
Valueoneachdimension=#of%meswordappearsindocument
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
15/53
Example
D1=Ilikedatabases
D2=Ihatehatedatabases
Eachrow=documentvector
Allrows=term-documentmatrix
Individualentry=(t,d)=termfrequency
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
16/53
AkaBagofwordsmodel
Throwsoutwordorder.
e.g.soldiersshotciviliansandciviliansshotsoldiers
encodediden%cally.
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
17/53
Tokeniza%on
Thedocumentscometousaslongstrings,notindividualwords.Tokeniza%onistheprocessof
conver%ngthestringintoindividualwords,or"tokens."
Forthiscourse,wewillassumeaverysimplestrategy:
convertallleerstolowercase removeallpunctua%oncharacters
separatewordsbasedonspaces
Notethatthiswon'tworkatallforChinese.Itwillfail
insomewaysevenforEnglish.How?
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
18/53
Distancefunc%on
Usefulfor:
clusteringdocuments
findingdocssimilartoexample matchingasearchqueryBasicidea:lookforoverlappingterms
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
19/53
Cosinesimilarity
Givendocumentvectorsa,bdefine
Ifeachwordoccursexactlyonceineachdocument,equivalenttocoun%ngoverlappingwords.
Note:notadistancefunc%on,assimilarityincreaseswhendocumentsaresimilar.(Whatpartofthedefini%onofadistancefunc%onisviolatedhere?)
similarity(a,b)
ab
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
20/53
Problem:longdocumentsalwayswin
Leta=Thiscarrunsfast.
Letb=Mycarisold.Iwantanewcar,ashinycar
Letquery=fastcar
this car runs fast my is old I want a new shiny
a 1 1 1 1 0 0 0 0 0 0 0 0
b 0 3 0 0 1 1 1 1 1 1 1 1
q 0 1 0 1 0 0 0 0 0 0 0 0
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
21/53
Problem:longdocumentsalwayswin
similarity(a,q)=1*1[car]+1*1[fast]=2
similarity(b,q)=3*1[car]+0*1[fast]=3
Longerdocumentmoresimilar,byvirtueof
repea%ngwords.
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
22/53
Normalizedocumentvectors
similarity(a, b) ab
a b
=cos()
returnsresultin[0,1]
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
23/53
Normalizedqueryexample
this car runs fast my is old I want a new shiny
a 1 1 1 1 0 0 0 0 0 0 0 0
b 0 3 0 0 1 1 1 1 1 1 1 1
q 0 1 0 1 0 0 0 0 0 0 0 0
similarity(a,q)=2
4 2=
1
2 0.707
similarity(b,q) =3
17 2 0.514
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
24/53
Cosinesimilarity
cos= similarity(a,b) ab
a b
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
25/53
Cosinedistance(finally)
dist(a, b) 1 a
ba b
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
26/53
Lecture2:TextAnalysis
Whatistextanalysis?
VectorSpaceModelandCosineDistance
TF-IDF
TopicModels
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
27/53
Problem:commonwords
Wewanttolookatwordsthatdiscriminate
amongdocuments.
Stopwords:ifalldocumentscontainthe,arealldocuments
similar?
Commonwords:ifmostdocumentscontaincarthencardoesnttellusmuchabout(contextual)similarity.
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
28/53
Contextmaers
CarReviewsGeneralNews
=containscar
=doesnotcontaincar
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
29/53
DocumentFrequency
Idea:de-weightcommonwords
Common=appearsinmanydocuments
documentfrequency=frac%onofdocscontainingterm
df(t, D) = d D : t d D
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
30/53
InverseDocumentFrequency
Invert(somorecommon=smallerweight)and
takelog
idf(t, D) = log D d D : t d( )
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
31/53
TF-IDF
Mul%plytermfrequencybyinversedocumentfrequency
n(t,d)=numberof%mestermtindocd
n(t,D)=numberdocsinDcontainingt
tfidf(t,d,D) = tf(t,d)idf(d,D)
= n(t, d) log D n(t,D)( )
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
32/53
TF-IDFdependsonen%recorpus
TheTF-IDFvectorforadocumentchangesifweaddanotherdocumenttothecorpus.
TF-IDFissensi%vetocontext.Thecontextisall
otherdocuments
tfidf(t,d,D) = tf(t,d)idf(d,D)
ifweaddadocument,Dchanges!
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
33/53
Whatisthisdocument"about"?
EachdocumentisnowavectorofTF-IDFscoresforeverywordinthedocument.Wecanlookatwhichwordshavethetopscores.
!
crimes! ! !0.0675591652263963!cruelty ! !0.0585772393867342!crime ! ! !0.0257614113616027!reporting! !0.0208838148975406!animals ! !0.0179258756717422!
michael ! !0.0156575858658684!category ! !0.0154564813388897!commit! ! !0.0137447439653709!criminal ! !0.0134312894429112!societal ! !0.0124164973052386!trends! ! !0.0119505837811614!conviction !0.0115699047136248!patterns ! !0.011248045148093!
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
34/53
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
35/53
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
36/53
Saltonsdescrip%onof-idf
-fromSalton,Wong,Yang,AVectorSpaceModelforAutoma=c
Indexing,1975
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
37/53
TF TF-IDF
nj-sentator-menendezcorpus,Overviewsamplefiles
color=humantagsgeneratedfromTF-IDFclusters
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
38/53
ClusterHypothesis
documentsinthesameclusterbehavesimilarly
withrespecttorelevancetoinforma%onneeds
-Manning,Raghavan,Schtze,Introduc=ontoInforma=onRetrieval
Notreallyaprecisestatementbutthecruciallinkbetween
humanseman%csandmathema%calproper%es.
Ar%culatedasearlyas1971,hasbeenshowntoholdatweb
scale,widelyassumed.
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
39/53
Bagofwords+TF-IDFhardtobeat
Prac%calwin:goodprecision-recallmetricsintestswith
human-taggeddocumentsets.
S%llthedominanttextindexingschemeusedtoday.
(Lucene,FAST,Google)Manyvariants.
Some,butnotmuch,theorytoexplainwhythisworks.(E.g.
whythatpar%cularidfformula?whydoesntindexing
bigramsimproveperformance?)
Collec%vely:
thevectorspacedocumentmodel
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
40/53
Lecture2:TextAnalysis
Whatistextanalysis?
VectorSpaceModelandCosineDistance
TF-IDF
TopicModels
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
41/53
ProblemStatement
Canthecomputertellusthetopicsina
documentset?Canthecomputerorganizethe
documentsbytopic?
Note:TF-IDFtellsusthetopicsofasingledocument,buthere
wewanttopicsofanen%redocumentset.
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
42/53
Simplestpossibletechnique
SumTF-IDFscoresforeachwordacrossen%re
documentset,choosetoprankingwords.
ThisishowOverviewgeneratesclusterdescrip%ons.Itwillalsobe
yourfirsthomeworkassignment.
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
43/53
TopicModelingAlgorithms
Basicidea:reducedimensionalityofdocumentvectorspace,soeachdimensionisatopic.
Eachdocumentisthenavectoroftopicweights.We
wanttofigureoutwhatdimensionsandweightsgiveagoodapproxima%onofthefullsetofwordsineachdocument.
Manyvariants:LSI,PLSI,LDA,NMF
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
44/53
MatrixFactoriza%on
Approximateterm-documentmatrixVas
productoftwolowerrankmatrixes
V W
H=
mdocsbynterms mdocsbyr"topics" r"topics"bynterms
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
45/53
MatrixFactoriza%on
A"topic"isagroupofwordsthatoccur
together.
paernofwordsinthistopic
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
46/53
Non-nega%veMatrixFactoriza%on
AllelementsofdocumentcoordinatematrixWandtopicmatrixHmustbe>=0
Simpleitera%vealgorithmtocompute.
S%llhavetochoosenumberoftopicsr
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
47/53
LatentDirichletAlloca%on
Imaginethateachdocumentiswrienbysomeone
goingthroughthefollowingprocess:
1. Foreachdocd,choosemixtureoftopicsp(z|d)2. Foreachwordwind,chooseatopiczfromp(z|d)3. Thenchoosewordfromp(w|z)Adocumenthasadistribu%onoftopics.
Eachtopicisadistribu%onofwords.
LDAtriestofindthesetwosetsofdistribu%ons.
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
48/53
"Documents"
LDAmodelseachdocumentasadistribu%onovertopics.Each
wordbelongstoasingletopic.
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
49/53
"Topics"
LDAmodelsatopicasadistribu%onoverallthewordsinthe
corpus.Ineachtopic,somewordsaremorelikely,someareless
likely.
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
50/53
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
51/53
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
52/53
Dimensionalityreduc%on
OutputofNMFandLDAisavectorofmuchlower
dimensionforeachdocument.("Document
coordinatesintopicspace.")
Dimensionsareconceptsortopicsinsteadof
words.
Canmeasurecosinedistance,cluster,etc.inthisnew
space.
-
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis
53/53