computational journalism at columbia, fall 2013: lecture 1, basics

Upload: jonathan-stray

Post on 14-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    1/56

    Fron%ersofComputa%onalJournalism

    ColumbiaJournalismSchool

    Week1:Basics

    September4,2013

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    2/56

    Lecture1:Basics

    ComputerScienceandJournalism

    Represen%ngData

    Interpre%ngHighDimensionalData

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    3/56

    Computa%onalJournalism:Defini%ons

    Broadlydefined,itcaninvolvechanginghow

    storiesarediscovered,presented,aggregated,

    mone%zed,andarchived.Computa%oncan

    advancejournalismbydrawingoninnova%ons

    intopicdetec%on,videoanalysis,

    personaliza%on,aggrega%on,visualiza%on,and

    sensemaking.

    -Cohen,Hamilton,Turner,Computa(onalJournalism

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    4/56

    Computa%onalJournalism:Defini%ons

    Storieswillemergefromstacksoffinancialdisclosureforms,courtrecords,legisla%vehearings,officials'calendarsormee%ngnotes,and

    regulators'emailmessagesthatnoonetodayhas%meormoneytomine.Withasuiteofrepor%ngtools,ajournalistwillbeabletoscan,transcribe,analyze,andvisualizethepaUernsinthese

    documents.

    -Cohen,Hamilton,Turner,Computa(onalJournalism

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    5/56

    Cohenetal.model

    Data Repor%ng

    ser

    Computer

    Science

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    6/56

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    7/56

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    8/56

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    9/56

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    10/56

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    11/56

    CSforpresenta%on/interac%on

    Data Repor%ng

    ser

    CS

    CS

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    12/56

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    13/56

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    14/56

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    15/56

    Filtermanystoriesforuser

    ser

    DataRepor%ng

    CS

    DataRepor%ng

    CS

    DataRepor%ng

    CS

    Filtering

    CSCS

    CS

    CS

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    16/56

    Whataneditorputsonthefrontpage GoogleNewsRedditscommentsystem

    TwiUer Facebooknewsfeed Techmeme

    Examplesoffilters

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    17/56

    MemetrackerbyLeskovic,Backstrom,Kleinberg

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    18/56

    Kony2012earlynetwork,byGiladLotan/Socialflow

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    19/56

    Trackeffects

    ser

    DataRepor%ng

    CS

    DataRepor%ng

    CS

    DataRepor%ng

    CS

    Filtering

    CSCS

    CS

    CS

    Effects

    CS

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    20/56

    ComputerScienceinJournalism

    Repor%ng

    Presenta%onFiltering

    Tracking

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    21/56

    Computa%onalJournalism:Defini%ons

    theapplica%onofcomputersciencetothe

    problemsofpublicinforma%on,knowledge,and

    belief,byprac%%onerswhoseetheirmissionas

    outsideofbothcommerceandgovernment.

    -JonathanStray,AComputa(onalJournalismReadingList

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    22/56

    CourseStructure

    Informa%onretrieval:TF-IDF,searchengines Textanalysis:clusteringandtopicmodeling Informa%onfilteringsystems Socialnetworkanalysis Knowledgerepresenta%on Drawingconclusionsfromdata Informa%onSecurity Trackingflowandeffects

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    23/56

    NaturalLanguage

    Processing

    DataScience

    Sociology

    Ar%ficial

    Intelligence

    Cogni%veScienceSta%s%cs

    GraphTheory

    Clustering

    TextAnalysis

    FilterDesign

    SocialNetworkAnalysis

    KnowledgeRepresenta%on

    DrawingConclusions

    Informa%onRetrieval

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    24/56

    Administra%on

    Assignmentaereachclass

    Fourassignmentsrequireprogramming,but

    yourwri%ngcountsformorethanyourcode!

    Courseblog

    hUp://jmsc.hku.hk/courses/jmsc6041spring2013/

    Finalproject

    tobecompletedFeb-April

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    25/56

    Lecture1:Basics

    ComputerScienceandJournalism

    Represen%ngData

    Interpre%ngHighDimensionalData

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    26/56

    acollec%onofsimilarpiecesofinforma%on

    Defini%onofdata

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    27/56

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    28/56

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    29/56

    structureddata

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    30/56

    unstructureddata

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    31/56

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    32/56

    Vectorrepresenta%onofobjects

    Fundamentalrepresenta%onfor(almost)all

    datamining,clustering,machinelearning,

    visualiza%on,NLP,etc.algorithms.

    x1

    x2

    x3

    xN

    !

    "

    ###

    ####

    $

    %

    &&&

    &&&&

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    33/56

    Eachxiisanumericalorcategoricalfeature

    N=numberoffeaturesordimension

    x1

    x2

    x3

    xN

    !

    "

    ######

    #

    $

    %

    &&&&&&

    &

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    34/56

    Examplesoffeatures

    numberofclaws la%tude color{red,yellow,blue} numberofbreak-ins 1forboughtX,0fordidnotbuyX %me,dura%on,etc. numberof%meswordYappearsindocument votescast

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    35/56

    Featureselec%on

    Technicalmeaninginmachinelearningetc.:

    whichvariablesma.er?

    Werejournalists,sowereinterestedinan

    earlierprocess:

    howtodescribetheworldinnumbers?

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    36/56

    ChoosingFeatures

    wherekN

    x1

    x2

    x3

    xN

    !

    "

    #####

    ##

    $

    %

    &&&&&

    &&

    xf(1)

    xf(2)

    xf(k)

    !

    "

    #####

    $

    %

    &&&&&

    Journalism

    Howdowerepresentthe

    world

    numerically?

    MachinelearningWhichvariables

    carrythemost

    informa%on?

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    37/56

    Differenttypesofquan%ta%ve

    Numericcon%nuouscountablebounded?unitsofmeasurement?

    Categoricalfinite,e.g.{on,off}infinitee.g.{red,yellow,blue,...chartreuse}

    ordered?equivalenceclassesorotherstructure?

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    38/56

    Differenttypesofscales

    Temperature

    Con%nuousscale,fixedzeropoint,physicalunits,

    compara%ve,uniform

    LikertScale

    Discretescale,nofixedorigin,abstractunits,

    compara%ve,non-uniform

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    39/56

    Likertscalesarenon-uniform

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    40/56

    Noaveragesonanon-uniformscale

    Itsnotlinear,so is2X1twiceasgood?

    (X1+c)(X2+c)X1X2

    Lotsofthingsdontmakemuchsense,suchas

    sum(X1...XN)/N=?

    Averageisnotwelldefined!(Norstddev,etc.)

    Butrankordersta%s%csarerobust.

    Andallofthismightnotbeaprobleminprac%ce.

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    41/56

    Otherissueswithquan%ta%ve

    Wheredidthedatacomefrom?physicalmeasurementcomputerlogginghumanrecording

    Whatarethesourcesoferror?measurementerrormissingdataambiguityinhumanclassifica%on

    processerrorsinten%onalbias/decep%on

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    42/56

    Evenwithallthesecaveats,thevector

    representa%onisincrediblyflexibleandpowerful.

    x1

    x2

    x3

    xN

    !

    "

    ######

    #

    $

    %

    &&&&&&

    &

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    43/56

    Examplesofvectorrepresenta%ons

    Obvious

    movieswatched/itemspurchasedLegisla%vevo%nghistoryforapoli%ciancrimeloca%ons

    Lessobvious,butstandarddocumentvectorspacemodelpsychologicalsurveyresults

    Trickyresearchproblem:disparatefieldtypesCorporatefilingdocumentWikileaksSIGACT

    h d h ?

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    44/56

    Whatcanwedowithvectors?

    Predictonevariablebasedonothers

    thisiscalledregressionsupervisedmachinelearning

    Groupsimilaritemstogether

    Thisisclassifica%onorclusteringWemayormaynotknowpre-exis%ngclasses

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    45/56

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    46/56

    Interpre%ngHighDimensionalData

    KHouseofLordsvo%ngrecord,2000-2012.

    N=1043votesbyM=1630lords

    2=aye,4=nay,-9=didn'tvote

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    47/56

    Votevectors

    letv(i,j)=voteofMPionissuej.Thenwecanlookatallvotesforapar%cularMP

    Nowwehave1043vectors,eachofdimension1630.

    Whatcouldwelearnfromthis?Whatistheir

    structure?

    mpi = v(i, 0) v(i,1)

    v(i,N)!" #$

    l h l

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    48/56

    VisualizingHighDimensionalData

    Wecanvisualize3dimensionsata%me.

    Whatdowedowith1043?

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    49/56

    LookingatallMPsforvotes100,200,300

    i i li d %

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    50/56

    Dimensionalityreduc%on

    Problem:vectorspaceishigh-dimensional.ptothousandsofdimensions.Thescreenistwo-

    dimensional.

    Wehavetogofrom

    xRN

    tomuchlowerdimensionalpoints

    yRK

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    51/56

    Thisiscalled"projec%on"

    Projec%onfrom3to2dimensions

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    52/56

    Thinkofthisasrota%ngtoalignthe"screen"withcoordinate

    axes,thensimplythrowingoutvaluesofhigherdimensions.

    Projec%onfrom3to2dimensions

    Di % f j % U !

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    53/56

    Direc%onofprojec%onmaUers!

    Whi h di % h ld l k f ?

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    54/56

    Whichdirec%onshouldwelookfrom?

    Intui%on:findadirec%onthat"spreadsout"points.

    H f L d PCA l i

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    55/56

    HouseofLordsPCAanalysis

    PrincipalComponentsAnalysisfindsthedirec%onsofmaximum

    variance.Here,we'replongthetwodimsofgreatestvariance.

    I t t % i t t

  • 7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

    56/56

    Interpreta%onrequirescontext

    Conserva%veandLiberalDemocratsreallydovotetogether,

    mostly Cross-benchers and bishops in the middle Labor opposite