Getting from Raw Data to Content Analysis with Custom Programs

download Getting from Raw Data to Content Analysis with Custom Programs

of 13

Transcript of Getting from Raw Data to Content Analysis with Custom Programs

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    1/13

    GETTING FROM RAW DATATO CONTENT ANALYSISWITH CUSTOM PROGRAMS

    JASON T. KILEY

    UNIVERSITY OF GEORGIA

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    2/13

    Overview

    Goal: high-level familiarization with data issues

    where programming (Python 3.x) can help.

    Retrieve the structure from semi-structured text.

    Identify texts of interest within a larger set.

    Output data in appropriate forms for content

    analysis.

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    3/13

    Semi-structured text

    Defined: Text that has a structure that cannot be

    used directly.

    Convert that semi-structure to a form we can use.

    Examples: LexisNexis (press releases, news), web

    pages, recent EDGAR documents.

    Isolate the text we want to analyze, and use the

    metadata.

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    4/13

    PRESS RELEASE 1

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    5/13

    PRESS RELEASE 1 & 2

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    6/13

    Regular expressions

    Defined: a way of specifying search patterns.

    Any given regex has a binary outcome dependingon whether it matches: true or false.

    In some cases, this true or false result is enough.

    We can also specify that we want to capture a part

    of the matching text to use.

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    7/13

    Identifying interesting texts

    Archival raw data is often full of texts that we are

    not interested in.

    Some data is naturally noisy.

    Other times, over-gathering is easier if we can

    automate the elimination of uninteresting texts.

    Human labor is smart but expensive and slow;

    computers are fast but only as smart as you can

    make it.

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    8/13

    REGULAR EXPRESSIONS

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    9/13

    Output

    Forms

    Individual files: LIWC, comparisons.

    Spreadsheets: coding, statistical analysis.

    Testing and project management: gatherdescriptive statistics, diagnostics.

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    10/13

    OUTPUT: INDIVIDUAL FILES

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    11/13

    OUTPUT: SPREADSHEET

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    12/13

    Scratching the surface

    If you can precisely describe how to do something,

    even if that explanation is complex, a computer can

    do the hard work.

    Automate raw data gathering.

    Use open source libraries for analysis.

    Write your own algorithms to analyze content.

  • 8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs

    13/13

    COMMENTS ANDQUESTIONS