Methodological Considerations in Analyzing Twitter Data · 2015. 1. 28. · RTI International RTI...

16
RTI International RTI International is a trade name of Research Triangle Institute. www.rti.org Methodological Considerations in Analyzing Twitter Data Annice Kim, Heather Hansen, Joe Murphy Presentation at AAPOR Annual Conference, May 2012, Orlando, FL.

Transcript of Methodological Considerations in Analyzing Twitter Data · 2015. 1. 28. · RTI International RTI...

  • RTI International

    RTI International is a trade name of Research Triangle Institute. www.rti.org

    Methodological Considerations in Analyzing Twitter Data

    Annice Kim, Heather Hansen, Joe MurphyPresentation at AAPOR Annual Conference, May 2012, Orlando, FL.

  • RTI International

    PurposeIn this session, we use examples from an ongoing study of Twitter data to illustrate methodological issues in analyzing Twitter data.

    We will discuss insights on: 1) sampling2) data cleaning3) volume + data management3) metrics4) time frame and unit of analysis

    We will conclude with areas for future research.

  • RTI International

    Background: Twitter GrowthTwitter began in July 2006

    Source: twitter (http://blog.twitter.com/2012/03/twitter-turns-six.html )

    0

    50

    100

    150

    200

    250

    300

    350

    400

    Jan-

    08

    Mar

    -08

    May

    -08

    Jul-0

    8

    Sep

    -08

    Nov

    -08

    Jan-

    09

    Mar

    -09

    May

    -09

    Jul-0

    9

    Sep

    -09

    Nov

    -09

    Jan-

    10

    Mar

    -10

    May

    -10

    Jul-1

    0

    Sep

    -10

    Nov

    -10

    Jan-

    11

    Mar

    -11

    May

    -11

    Jul-1

    1

    Sep

    -11

    Nov

    -11

    Jan-

    12

    Mar

    -12

    Tweets per dayRegistered Users

    milli

    ons

    3 million users 300,000 tweets/day

    340 million+ tweets/day

    140 million users

  • RTI International

    Background: Impact of TwitterRecent studies highlight the importance of twitter in helping researchers understand public discourse and public opinion about wide range of topics including health.

    “Pandemics in the age of Twitter” – Chew, 2010

    “Predicting the future with social media” - Asur & Huberman (2010)

    “From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series” - O’Connor, Balasubramanyan, Routledge, & Smith (2010)

  • RTI International

    Data Source

    Topics: salvia, ketamine, cocaine, flu Relevant tweets from radian6 Google Insights for Search Prevalence rates of drug use - NSDUH Confirmed flu cases - CDC, MMWR More info: bit.ly/twitterNSDUH

    Can tweets and google searches forecast trends in actual health behavior?

  • RTI International

    1. Sample Frame

    API FirehoseData Available 1-10%+

    Sample Full Sample

    Historical Data No Yes (availability varies by vendor)

    Cost Free Varies by Vendor/Volume ($500+)

    Twitter default search only goes back 1-week + cannot handle multiple keyword searches

    Third party sources: Application Programming Interface (API) vs. firehoseaccess

  • RTI International

    2. Noise/ Data CleaningOther non-related conversations may be driving your topic coverage.

    For some topics, noise level is high (e.g.“cocaine”)

    Salvia Salvia – “gardening”

  • RTI International

    3) Volume + Data Management

    o Limits on the amount of data that can be exported at one time e.g. radian6 allows only 5,000 cases

    o Tweet files need to be merged for use with text analysis software, which also have limits on volume of data it can import and analyze.

    17 months of healthcare reform Tweets

    1.5 million Tweets

    300 radian6 exports

    26 CSV files

    78 STAS files (~20k tweets per run)

  • RTI International

    4) Metrics

    # of salvia tweets (daily)

    0

    5000

    10000

    15000

    Salvia Tweets, October 1 - December 31, 2010

    Tweets (day)

    % of tweeters mentioning salvia at least once (weekly)

    0.0000000

    0.0001000

    0.0002000

    0.0003000

    % of Tweeters mentioning "salvia" at least once (week)

    Salvia tweets as % of all tweets (daily)

    0.00000000.00002000.00004000.00006000.00008000.0001000

    % Salvia Tweets (day)

  • RTI International

    4) Metrics (cont)Unadjusted: # of total tweets per day Adjusted: % of tweets per day

  • RTI International

    5) Time Frame/ Unit of Analysis

    0.000000

    0.000010

    0.000020

    0.000030

    0.000040

    0.000050

    0.000060

    0.000070

    0.000080

    0.000090

    0.000100

    1-M

    ay-0

    8

    1-Ju

    n-08

    1-Ju

    l-08

    1-A

    ug-0

    8

    1-S

    ep-0

    8

    1-O

    ct-0

    8

    1-N

    ov-0

    8

    1-D

    ec-0

    8

    1-Ja

    n-09

    1-Fe

    b-09

    1-M

    ar-0

    9

    1-A

    pr-0

    9

    1-M

    ay-0

    9

    1-Ju

    n-09

    1-Ju

    l-09

    1-A

    ug-0

    9

    1-S

    ep-0

    9

    1-O

    ct-0

    9

    1-N

    ov-0

    9

    1-D

    ec-0

    9

    1-Ja

    n-10

    1-Fe

    b-10

    1-M

    ar-1

    0

    1-A

    pr-1

    0

    1-M

    ay-1

    0

    1-Ju

    n-10

    1-Ju

    l-10

    1-A

    ug-1

    0

    1-S

    ep-1

    0

    1-O

    ct-1

    0

    1-N

    ov-1

    0

    1-D

    ec-1

    0

    % Salvia Tweets (day)May 1, 2008 - December 31, 2010

    % Salvia Tweets

  • RTI International

    5) Time Frame/ Unit of Analysis (cont)

    0.00000000.00000500.00001000.00001500.00002000.00002500.00003000.0000350

    3-O

    ct

    10-O

    ct

    17-O

    ct

    24-O

    ct

    31-O

    ct

    7-N

    ov

    14-N

    ov

    21-N

    ov

    28-N

    ov

    5-D

    ec

    12-D

    ec

    19-D

    ec

    26-D

    ec

    % Salvia Tweets (week)

    0.0000000

    0.0000200

    0.0000400

    0.0000600

    0.0000800

    0.00010001-

    Oct

    8-O

    ct

    15-O

    ct

    22-O

    ct

    29-O

    ct

    5-N

    ov

    12-N

    ov

    19-N

    ov

    26-N

    ov

    3-D

    ec

    10-D

    ec

    17-D

    ec

    24-D

    ec

    31-D

    ec

    % Salvia Tweets (day)

  • RTI International

    5) Time Frame/ Unit of Analysis (cont)

    KetamineMay 1, 2008–December 31, 2010

    5/1/08 7/1/08 10/1/08 1/1/09 4/1/09 7/1/09 10/1/09 1/1/10 4/1/10 7/1/10 10/1/10 1/1/11

  • RTI International

    Summary: Key Considerations Topic suitable for twitter analysis?

    – Enough conversation?– High noise potential?

    Can you use a sample? Which metric is most useful?

    – Raw volume? As a proportion of all tweets?

    Are you trying to compare trends?– Timeframe of data sources– Unit of analysis

    Do you have enough resources?– Potential cost of historical data– Data export, cleaning and analysis

  • RTI International

    Future Studies

    • Need for standards in sampling • Compare sample from API? Is it a random sample? Bias?

    • Need for standards in metrics • More frequent data from twitter, e.g. daily Twitter volume for calculating

    denominator, filter out spam

    • Insights into general patterns of Twitter use and demographics of users

  • RTI International

    More Information

    Annice KimRTI International - [email protected]

    Heather HansenRTI International – [email protected]

    Joe MurphyRTI International - [email protected]