computational journalism at columbia, fall 2013: lecture 2, text analysis

Upload: jonathan-stray

Post on 14-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    1/53

    Fron%ersofComputa%onalJournalism

    ColumbiaJournalismSchool

    Week2:TextAnalysis

    September11,2013

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    2/53

    Lecture2:TextAnalysis

    Whatistextanalysis?

    VectorSpaceModelandCosineDistance

    TF-IDF

    TopicModels

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    3/53

    Basicidea:quan%ta%veinforma%oncantellstories

    WhenHuJintaocametopowerin2002,Chinawasalreadyexperiencingaworsening

    socialcrisis.In2004,PresidentHuofferedarhetoricalresponsetogrowinginternal

    instability,trumpe%ngwhathecalledaharmonioussociety.Forsome%me,thisnew

    watchwordburgeoned,becomingvisibleeverywhereinthePartyspropaganda.

    -QianGang,Watchwords:ReadingChinathroughitsPartyVocabulary

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    4/53

    Butby2007itwasalreadyonthedecline,asstabilitypreserva%onmadeitsrapid

    ascent....Together,thesecontras%ngpicturesoftheharmonioussocietyandstabilitypreserva%onformaportraitoftherealpredicamentfacingPresidentHuJintao.A

    harmonioussocietymaybeapleasingidea,butitstheironwillbehindstability

    preserva%onthatpackstherealpunch.

    -QianGang,Watchwords:ReadingChinathroughitsPartyVocabulary

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    5/53

    Googlengramsviewer12ofallbookseverpublished

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    6/53

    Datacangiveawiderview

    LetmetalkaboutDowntonAbbeyforaminute.Theshow'spopularityhasledmanynitpickerstodra^uplistsofmistakes....

    Butallofthesehaverelied,sofarasIcantell,onfindinga

    phraseortwothatsoundsabitoff,andcheckingtheonline

    sourcesforearliestuse.

    Ilacksuchsocialgraces.SoIthought:whynotjustcheckevery

    singlelineintheshowforhistoricalaccuracy?...SoIfoundsome

    copiesoftheDowntonAbbeyscriptsonline,andfedeverysingle

    two-wordphrasethroughtheGoogleNgramdatabasetosee

    howcharacteris%coftheEnglishLanguage,c.1917,Downton

    Abbeyreallyis.

    -BenSchmidt,MakingDowntonmoretradi=onal

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    7/53

    BigramsthatdonotappearinEnglishbooksbetween1912and

    1921.

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    8/53

    Bigramsthatareatleast100%mesmorecommontodaythan

    theywerein1912-1921

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    9/53

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    10/53

    Lecture2:TextAnalysis

    Whatistextanalysis?

    VectorSpaceModelandCosineDistance

    TF-IDF

    TopicModels

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    11/53

    Documents,notwords

    Wecanuseclusteringtechniquesifwecan

    convertdocumentsintovectors.

    Asbefore,wewanttofindnumericalfeatures

    thatdescribethedocument.

    Howdowecapturethemeaningofadocument

    innumbers?

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    12/53

    Whatisthisdocument"about"?

    Mostcommonlyoccurringwordsapreygoodindicator.!

    30 !the!23 !to!19 !and!19 !a!18 !animal!17 !cruelty!15 !of!15 !crimes!14 !in!14 !for!11 !that!8 !crime!7 !we!

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    13/53

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    14/53

    Turnsoutfeatures=wordsworksfine

    Encodeeachdocumentasthelistofwordsit

    contains.

    Dimensions=vocabularyofdocumentset.

    Valueoneachdimension=#of%meswordappearsindocument

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    15/53

    Example

    D1=Ilikedatabases

    D2=Ihatehatedatabases

    Eachrow=documentvector

    Allrows=term-documentmatrix

    Individualentry=(t,d)=termfrequency

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    16/53

    AkaBagofwordsmodel

    Throwsoutwordorder.

    e.g.soldiersshotciviliansandciviliansshotsoldiers

    encodediden%cally.

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    17/53

    Tokeniza%on

    Thedocumentscometousaslongstrings,notindividualwords.Tokeniza%onistheprocessof

    conver%ngthestringintoindividualwords,or"tokens."

    Forthiscourse,wewillassumeaverysimplestrategy:

    convertallleerstolowercase removeallpunctua%oncharacters

    separatewordsbasedonspaces

    Notethatthiswon'tworkatallforChinese.Itwillfail

    insomewaysevenforEnglish.How?

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    18/53

    Distancefunc%on

    Usefulfor:

    clusteringdocuments

    findingdocssimilartoexample matchingasearchqueryBasicidea:lookforoverlappingterms

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    19/53

    Cosinesimilarity

    Givendocumentvectorsa,bdefine

    Ifeachwordoccursexactlyonceineachdocument,equivalenttocoun%ngoverlappingwords.

    Note:notadistancefunc%on,assimilarityincreaseswhendocumentsaresimilar.(Whatpartofthedefini%onofadistancefunc%onisviolatedhere?)

    similarity(a,b)

    ab

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    20/53

    Problem:longdocumentsalwayswin

    Leta=Thiscarrunsfast.

    Letb=Mycarisold.Iwantanewcar,ashinycar

    Letquery=fastcar

    this car runs fast my is old I want a new shiny

    a 1 1 1 1 0 0 0 0 0 0 0 0

    b 0 3 0 0 1 1 1 1 1 1 1 1

    q 0 1 0 1 0 0 0 0 0 0 0 0

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    21/53

    Problem:longdocumentsalwayswin

    similarity(a,q)=1*1[car]+1*1[fast]=2

    similarity(b,q)=3*1[car]+0*1[fast]=3

    Longerdocumentmoresimilar,byvirtueof

    repea%ngwords.

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    22/53

    Normalizedocumentvectors

    similarity(a, b) ab

    a b

    =cos()

    returnsresultin[0,1]

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    23/53

    Normalizedqueryexample

    this car runs fast my is old I want a new shiny

    a 1 1 1 1 0 0 0 0 0 0 0 0

    b 0 3 0 0 1 1 1 1 1 1 1 1

    q 0 1 0 1 0 0 0 0 0 0 0 0

    similarity(a,q)=2

    4 2=

    1

    2 0.707

    similarity(b,q) =3

    17 2 0.514

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    24/53

    Cosinesimilarity

    cos= similarity(a,b) ab

    a b

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    25/53

    Cosinedistance(finally)

    dist(a, b) 1 a

    ba b

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    26/53

    Lecture2:TextAnalysis

    Whatistextanalysis?

    VectorSpaceModelandCosineDistance

    TF-IDF

    TopicModels

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    27/53

    Problem:commonwords

    Wewanttolookatwordsthatdiscriminate

    amongdocuments.

    Stopwords:ifalldocumentscontainthe,arealldocuments

    similar?

    Commonwords:ifmostdocumentscontaincarthencardoesnttellusmuchabout(contextual)similarity.

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    28/53

    Contextmaers

    CarReviewsGeneralNews

    =containscar

    =doesnotcontaincar

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    29/53

    DocumentFrequency

    Idea:de-weightcommonwords

    Common=appearsinmanydocuments

    documentfrequency=frac%onofdocscontainingterm

    df(t, D) = d D : t d D

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    30/53

    InverseDocumentFrequency

    Invert(somorecommon=smallerweight)and

    takelog

    idf(t, D) = log D d D : t d( )

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    31/53

    TF-IDF

    Mul%plytermfrequencybyinversedocumentfrequency

    n(t,d)=numberof%mestermtindocd

    n(t,D)=numberdocsinDcontainingt

    tfidf(t,d,D) = tf(t,d)idf(d,D)

    = n(t, d) log D n(t,D)( )

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    32/53

    TF-IDFdependsonen%recorpus

    TheTF-IDFvectorforadocumentchangesifweaddanotherdocumenttothecorpus.

    TF-IDFissensi%vetocontext.Thecontextisall

    otherdocuments

    tfidf(t,d,D) = tf(t,d)idf(d,D)

    ifweaddadocument,Dchanges!

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    33/53

    Whatisthisdocument"about"?

    EachdocumentisnowavectorofTF-IDFscoresforeverywordinthedocument.Wecanlookatwhichwordshavethetopscores.

    !

    crimes! ! !0.0675591652263963!cruelty ! !0.0585772393867342!crime ! ! !0.0257614113616027!reporting! !0.0208838148975406!animals ! !0.0179258756717422!

    michael ! !0.0156575858658684!category ! !0.0154564813388897!commit! ! !0.0137447439653709!criminal ! !0.0134312894429112!societal ! !0.0124164973052386!trends! ! !0.0119505837811614!conviction !0.0115699047136248!patterns ! !0.011248045148093!

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    34/53

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    35/53

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    36/53

    Saltonsdescrip%onof-idf

    -fromSalton,Wong,Yang,AVectorSpaceModelforAutoma=c

    Indexing,1975

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    37/53

    TF TF-IDF

    nj-sentator-menendezcorpus,Overviewsamplefiles

    color=humantagsgeneratedfromTF-IDFclusters

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    38/53

    ClusterHypothesis

    documentsinthesameclusterbehavesimilarly

    withrespecttorelevancetoinforma%onneeds

    -Manning,Raghavan,Schtze,Introduc=ontoInforma=onRetrieval

    Notreallyaprecisestatementbutthecruciallinkbetween

    humanseman%csandmathema%calproper%es.

    Ar%culatedasearlyas1971,hasbeenshowntoholdatweb

    scale,widelyassumed.

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    39/53

    Bagofwords+TF-IDFhardtobeat

    Prac%calwin:goodprecision-recallmetricsintestswith

    human-taggeddocumentsets.

    S%llthedominanttextindexingschemeusedtoday.

    (Lucene,FAST,Google)Manyvariants.

    Some,butnotmuch,theorytoexplainwhythisworks.(E.g.

    whythatpar%cularidfformula?whydoesntindexing

    bigramsimproveperformance?)

    Collec%vely:

    thevectorspacedocumentmodel

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    40/53

    Lecture2:TextAnalysis

    Whatistextanalysis?

    VectorSpaceModelandCosineDistance

    TF-IDF

    TopicModels

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    41/53

    ProblemStatement

    Canthecomputertellusthetopicsina

    documentset?Canthecomputerorganizethe

    documentsbytopic?

    Note:TF-IDFtellsusthetopicsofasingledocument,buthere

    wewanttopicsofanen%redocumentset.

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    42/53

    Simplestpossibletechnique

    SumTF-IDFscoresforeachwordacrossen%re

    documentset,choosetoprankingwords.

    ThisishowOverviewgeneratesclusterdescrip%ons.Itwillalsobe

    yourfirsthomeworkassignment.

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    43/53

    TopicModelingAlgorithms

    Basicidea:reducedimensionalityofdocumentvectorspace,soeachdimensionisatopic.

    Eachdocumentisthenavectoroftopicweights.We

    wanttofigureoutwhatdimensionsandweightsgiveagoodapproxima%onofthefullsetofwordsineachdocument.

    Manyvariants:LSI,PLSI,LDA,NMF

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    44/53

    MatrixFactoriza%on

    Approximateterm-documentmatrixVas

    productoftwolowerrankmatrixes

    V W

    H=

    mdocsbynterms mdocsbyr"topics" r"topics"bynterms

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    45/53

    MatrixFactoriza%on

    A"topic"isagroupofwordsthatoccur

    together.

    paernofwordsinthistopic

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    46/53

    Non-nega%veMatrixFactoriza%on

    AllelementsofdocumentcoordinatematrixWandtopicmatrixHmustbe>=0

    Simpleitera%vealgorithmtocompute.

    S%llhavetochoosenumberoftopicsr

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    47/53

    LatentDirichletAlloca%on

    Imaginethateachdocumentiswrienbysomeone

    goingthroughthefollowingprocess:

    1. Foreachdocd,choosemixtureoftopicsp(z|d)2. Foreachwordwind,chooseatopiczfromp(z|d)3. Thenchoosewordfromp(w|z)Adocumenthasadistribu%onoftopics.

    Eachtopicisadistribu%onofwords.

    LDAtriestofindthesetwosetsofdistribu%ons.

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    48/53

    "Documents"

    LDAmodelseachdocumentasadistribu%onovertopics.Each

    wordbelongstoasingletopic.

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    49/53

    "Topics"

    LDAmodelsatopicasadistribu%onoverallthewordsinthe

    corpus.Ineachtopic,somewordsaremorelikely,someareless

    likely.

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    50/53

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    51/53

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    52/53

    Dimensionalityreduc%on

    OutputofNMFandLDAisavectorofmuchlower

    dimensionforeachdocument.("Document

    coordinatesintopicspace.")

    Dimensionsareconceptsortopicsinsteadof

    words.

    Canmeasurecosinedistance,cluster,etc.inthisnew

    space.

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

    53/53