getting from raw data to content analysis with custom programs

Upload: terry-college-of-business

Post on 02-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    1/13

    GETTING FROM RAW DATATO CONTENT ANALYSISWITH CUSTOM PROGRAMS

    JASON T. KILEY

    UNIVERSITY OF GEORGIA

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    2/13

    Overview

    Goal: high-level familiarization with data issues

    where programming (Python 3.x) can help.

    Retrieve the structure from semi-structured text.

    Identify texts of interest within a larger set.

    Output data in appropriate forms for content

    analysis.

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    3/13

    Semi-structured text

    Defined: Text that has a structure that cannot be

    used directly.

    Convert that semi-structure to a form we can use.

    Examples: LexisNexis (press releases, news), web

    pages, recent EDGAR documents.

    Isolate the text we want to analyze, and use the

    metadata.

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    4/13

    PRESS RELEASE 1

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    5/13

    PRESS RELEASE 1 & 2

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    6/13

    Regular expressions

    Defined: a way of specifying search patterns.

    Any given regex has a binary outcome dependingon whether it matches: true or false.

    In some cases, this true or false result is enough.

    We can also specify that we want to capture a part

    of the matching text to use.

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    7/13

    Identifying interesting texts

    Archival raw data is often full of texts that we are

    not interested in.

    Some data is naturally noisy.

    Other times, over-gathering is easier if we can

    automate the elimination of uninteresting texts.

    Human labor is smart but expensive and slow;

    computers are fast but only as smart as you can

    make it.

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    8/13

    REGULAR EXPRESSIONS

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    9/13

    Output

    Forms

    Individual files: LIWC, comparisons.

    Spreadsheets: coding, statistical analysis.

    Testing and project management: gatherdescriptive statistics, diagnostics.

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    10/13

    OUTPUT: INDIVIDUAL FILES

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    11/13

    OUTPUT: SPREADSHEET

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    12/13

    Scratching the surface

    If you can precisely describe how to do something,

    even if that explanation is complex, a computer can

    do the hard work.

    Automate raw data gathering.

    Use open source libraries for analysis.

    Write your own algorithms to analyze content.

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    13/13

    COMMENTS ANDQUESTIONS