The Art of Social Media Analysis with Twitter & Python-OSCON 2012

The Art of Social Media Analysis �

with Twitter & Python

krishna sankar @ksankar

http://www.oscon.com/oscon2012/public/schedule/detail/23130

Intro

API, Objects,…

Twitter Network Analysis Pipeline

@mention network

Growth, weakties

Rewteeet analytics, Information contagion

Cliques, social graph

NLP, NLTK, Sentiment Analysis

#tag Network

o  House Rules (1 of 2) o  Doesn’t assume any knowledge

of Twitter API o  Goal: Everybody in the same

page & get a working knowledge of Twitter API

o  To bootstrap your exploration into Social Network Analysis & Twitter

o  Simple programs, to illustrate usage & data manipulation

We will analyze @clouderati, 2072 followers, exploding to ~980,000 distinct users down one level

Intro

API, Objects,…

Twitter Network Analysis Pipeline

@mention network

Growth, weakties

Rewteeet analytics, Information contagion

Cliques, social graph

NLP, NLTK, Sentiment Analysis

#tag Network

o  House Rules (2 of 2) o  Am using the requests library o  There are good Twitter frameworks

for python, but wanted to build from the basics. Once one understands the fundamentals, frameworks can help

o  Many areas to explore – not enough time. So decided to focus on social graph, cliques & networkx

We will analyze @clouderati,2072 followers, exploding to ~980,000 distinct users down one level

About Me •  Lead Engineer/Data Scientist/AWS Ops Guy at

Genophen.com o  Co-‐chair – 2012 IEEE Precision Time Synchronization

•  http://www.ispcs.org/2012/index.html o  Blog : http://doubleclix.wordpress.com/ o  Quora : http://www.quora.com/Krishna-‐Sankar

•  Prior Gigs o  Lead Architect (Egnyte) o  Distinguished Engineer (CSCO) o  Employee #64439 (CSCO) to #39(Egnyte) & now #9 !

•  Current Focus: o  Design, build & ops of BioInformatics/Consumer Infrastructure on AWS,

MongoDB, Solr, Drupal,GitHub,… o  Big Data (more of variety, variability, context & graphs, than volume or velocity –

so far !) o  Overlay based semantic search & ranking

•  Other related Presentations o  http://goo.gl/P1rhc Big Data Engineering Top 10 Pragmatics (Summary) o  http://goo.gl/0SQDV The Art of Big Data (Detailed) o  http://goo.gl/EaUKH The Hitchhiker’s Guide to Kaggle OSCON 2011 Tutorial

Twitter Tips – A Baker’s Dozen 1.  Twitter APIs are (more or less) congruent & symmetric 2.  Twitter is usually right & simple -‐ recheck when you get unexpected results

before blaming Twitter o  I was getting numbers when I was expecting screen_names in user objects. o  Was ready to send blasting e-‐mails to Twitter team. Decided to check one more time

and found that my parameter key was wrong-‐screen_name instead of user_id o  Always test with one or two records before a long run ! - learned the hard way

3.  Twitter APIs are very powerful – consistent use can bear huge data o  In a week, you can pull in 4-‐5 million users & some tweets ! o  Night runs are far more faster & error-free

4.  Use a NOSQL data store as a command buffer & data buffer o  Would make it easy to work with Twitter at scale o  I use MongoDB o  Keep the schema simple & no fancy transformation

•  And as far as possible same as the (json) response o  Use NOSQL CLI for trimming records et al

The End As The Beginning

Twitter Tips – A Baker’s Dozen 5.  Always use a big data pipeline

o  Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize o  That way you can orthogonally extend, with functional components like command buffers,

validation et al 6.  Use functional approach for a scalable pipeline

o  Compose your data big pipeline with well defined granular functions, each doing only one thing o  Don’t overload the functional components (i.e. no collect, unroll & store as a single component) o  Have well defined functional components with appropriate caching, buffering, checkpoints &

restart techniques •  This did create some trouble for me, as we will see later

7.  Crawl-‐Store-‐Validate-‐Recrawl-‐Refresh cycle o  The equivalent of the traditional ETL o  Validation stage & validation routines are important

•  Cannot expect perfect runs •  Cannot manually look at data either, when data is at scale

8.  Have control numbers to validate runs & monitor them o  I still remember control numbers which start with the number of punch cards in the input deck &d then follow that

number through the various runs ! o  There will be a separate printout of the control numbers that will be kept in the operations files

Twitter Tips – A Baker’s Dozen 9.  Program defensively

o  more so for a REST-based-Big Data-Analytics systems o  Expect failures at the transport layer & accommodate for them

10.  Have Erlang-‐style supervisors in your pipeline o  Fail fast & move on o  Don’t linger and try to fix errors that cannot be controlled at that layer o  A higher layer process will circle back and do incremental runs to

correct missing spiders and crawls o  Be aware of visibility & lack of context. Validate at the lowest layer that

has enough context to take corrective actions o  I have an example in part 2

11.  Data will never be perfect o  Know your data & accommodate for it’s idiosyncrasies

•  for example: 0 followers, protected users, 0 friends,…

Twitter Tips – A Baker’s Dozen 12.  Check Point frequently (preferably after ever API call) & have a

re-‐startable command buffer cache o  See a MongoDB example in Part 2

13.  Don’t bombard the URL o  Wait a few seconds before successful calls. This will end up with a

scalable system, eventually o  I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to

work with 5 seconds with wait & retry. Then, the rate limit started kicking in ! 14.  Always measure the elapsed time of your API runs & processing

o  Kind of early warning when something is wrong

15.  Develop incrementally; don’t fail to check “cut & paste” errors

Twitter Tips – A Baker’s Dozen 16.  The Twitter big data pipeline has lots of opportunities for parallelism

o  Leverage data parallelism frameworks like MapReduce o  But first :

§  Prototype as a linear system, §  Optimize and tweak the functional modules & cache strategies, §  Note down stages and tasks that can be parallelized and §  Then parallelize them

o  For the example project, we will see later, I did not leverage any parallel frameworks, but the opportunities were clearly evident. I will point them out, as we progress through the tutorial

17.  Pay attention to handoffs between stages o  They might require transformation – for example collect & store might store a user list

as multiple arrays, while the model requires each user to be a document for aggregation

o  But resist the urge to overload collect with transform o  i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform

the array to separate documents o  Add transformation as a granular function – of course, with appropriate buffering, caching,

checkpoints & restart techniques 18.  Have a good log management system to capture and wade through

logs

Twitter Tips – A Baker’s Dozen 19.  Understand the underlying network characteristics for the

inference you want to make o  Twitter Network != Facebook Network , Twitter Graph != LinkedIn Graph

o  Twitter Network is more of an Interest Network o  So, many of the traditional network mechanisms & mechanics, like network

diameter & degrees of separation, might not make sense o  But, others like Cliques and Bipartite Graphs do

Twitter Gripes 1.  Need more rich APIs for #tags

o  Somewhat similar to users viz. followers, friends et al o  Might make sense to make #tags a top level object with it’s own semantics

2.  HTTP Error Return is not uniform o  Returns 400 bad Request instead of 420 o  Granted, there is enough information to figure this out

3.  Need an easier way to get screen_name from user_id 4.  “following” vs. “friends_count” i.e. “following” is a dummy variable.

o  There are a few like this, most probably for backward compatibility 5.  Parameter Validation is not uniform

o  Gives “404 Not found” instead of “406 Not Acceptable” or “413 Too Long” or “416 Range Unacceptable”

6.  Overall more validation would help o  Granted, it is more of growing pains. Once one comes across a few inconsistencies, the

rest is easy to figure out

A Fork

• Not enough time for both

•  NLP,NLTK & d

eep

into Tweets

o  Sen4ment

Analysis

• I chose the Social Graph route

A minute about Twitter as platform & it’s evolution

My Wish & Hope •  I spend a lot of time with Twitter & derive value; the platform is rich & the APIs intuitive •  I did like the fact that tweets are part of LinkedIn. I still used Twitter more than LinkedIn

o  I don’t think showing Tweets in LinkedIn took anything away from the Twitter experience o  LinkedIn experience & Twitter experience are different & distinct. Showing tweets in LinkedIn didn’t change that

•  I sincerely hope that the platform grows with a rich developer eco system •  Orthogonally extensible platform is essential

•  Of course, along with a congruent user experience – “ … core Twitter consumption experience through consistent tools”

https://dev.t

witter.com/blog/

delivering-‐c

onsistent-‐tw

itter-‐

experience “The micro-blogging service must find the

right balance of running a profitable business and maintaining a robust developers' community.” – Chenda, CBS news!

“.. we want to make sure that the Twitter experience is straightforward and easy to understand -- whether you’re on

Twitter.com or elsewhere on the web”-Michael!

Setup •  For Hands on Today

o  Python 2.7.3 o  easy_install –v requests

•  http://docs.python-‐requests.org/en/latest/user/quickstart/#make-‐a-‐request

o  easy_install –v requests-‐oauth o  Hands on programs at https://github.com/xsankar/oscon2012-‐handson

•  For advanced data science with social graphs o  easy_install –v networkx

o  easy_install –v numpy o  easy_install –v nltk

•  Not for this tutorial, but good for sentiment analysis et al o  Mongodb

•  I used MongoDB in AWS m2.xlarge, RAID 10 X 8 X 15 GB EBS o  graphviz -‐ http://www.graphviz.org/; easy_install pygraphviz o  easy_install pydot

Thanks To these Giants …

Problem Domain For this tutorial •  Data Science (trends, analytics et al) on Social Networks as

observed by Twitter primitives o  Not for Twitter based apps for real time tweets o  Not web sites with real time tweets

•  By looking at the domain in aggregate to derive inferences & actionable recommendations

•  Which also means, you need to be deliberate & systemic ( i.e. not look at a fluctuation as a trend but dig deeper before pronouncing a trend)

Agenda I.  Mechanics : Twitter API (1:30 PM -‐ 3:00 PM)

o  Essential Fundamentals (Rate Limit, HTTP Codes et al) o  Objects o  API o  Hands-‐on (2:45 PM -‐ 3:00 PM)

II.  Break (3:00 PM -‐ 3:30 PM) III.  Twitter Social Graph Analysis (3:30 PM -‐ 5:00 PM)

o  Underlying Concepts o  Social Graph Analysis of @clouderati

§  Stages, Strategies & Tasks §  Code Walk thru

Open This First

Twi5er API : Read These First •  Using Twitter Brand

o  New logo & associated guidelines : https://twitter.com/about/logos o  Twitter Rules :

https://support.twitter.com/groups/33-‐report-‐a-‐violation/topics/121-‐guidelines-‐best-‐practices/articles/18311-‐the-‐twitter-‐rules

o  Developer Rules of the road https://dev.twitter.com/terms/api-‐terms •  Read These Links First

1.  https://dev.twitter.com/docs/things-‐every-‐developer-‐should-‐know 2.  https://dev.twitter.com/docs/faq 3.  Field Guide to Objects https://dev.twitter.com/docs/platform-‐objects 4.  Security https://dev.twitter.com/docs/security-‐best-‐practices 5.  Media Best Practices : https://dev.twitter.com/media 6.  Consolidates Page : https://dev.twitter.com/docs 7.  Streaming APIs https://dev.twitter.com/docs/streaming-‐apis 8.  How to Appeal (Not that you all would need it !) https://support.twitter.com/

articles/72585 •  Only One version of Twitter APIs

API Status Page

•  https://dev.twitter.com/status •  https://dev.twitter.com/issues •  https://dev.twitter.com/discussions

h5ps://dev.twi5er.com/status

http://www.buzzfeed.com/tommywilhelm/google-‐users-‐being-‐total-‐dicks-‐about-‐the-‐twitter

Open This First •  Install pre-‐req as per the setup slide •  Run

o  oscon2012_open_this_first.py o  To test connectivity – “canary query”

•  Run o  oscon2012_rate_limit_status.py o  Use http://www.epochconverter.com to check reset_time

•  Formats xml, json, atom & rss

Twitter API

REST Streaming

Twitter REST

Core Data, Core Twitter Objects

Near-realtime, High Volume

Twitter Search

Seach & Trend

Keywords Specific User Trends

Build Profile Create/Post Tweets Reply Favorite, Re-‐tweet

Public Streams User Streams Site Streams

Follow users, topics, data mining

Rate Limit : 150/350

Rate Limit : Complexity & Frequency

Firehose

Rate Limit

Rate Limits •  By API type & Authentication Mode

API No authC authC Error

REST 150/hr 350/hr 400

Search Complexity & Frequency

-‐N/A-‐ 420

Streaming Upto 1%

Fire hose none none

Rate Limit Header •  { •  "status": "200 OK", •  "vary": "Accept-‐Encoding", •  "x-‐frame-‐options": "SAMEORIGIN", •  "x-‐mid": "8e775a9323c45f2a541eeb4d2d1eb9b468w81c6", •  "x-‐ratelimit-‐class": "api", •  "x-‐ratelimit-‐limit": "150", •  "x-‐ratelimit-‐remaining": "149", •  "x-‐ratelimit-‐reset": "1340467358", •  "x-‐runtime": "0.04144", •  "x-‐transaction": "2b49ac31cf8709af", •  "x-‐transaction-‐mask":

"a6183ffa5f8ca943ff1b53b5644ef114df9d6bba" •  }

Rate Limit-‐‑ed Header •  { •  "cache-‐control": "no-‐cache, max-‐age=300", •  "content-‐encoding": "gzip", •  "content-‐length": "150", •  "content-‐type": "application/json; charset=utf-‐8", •  "date": "Wed, 04 Jul 2012 00:48:25 GMT", •  "expires": "Wed, 04 Jul 2012 00:53:25 GMT", •  "server": "tfe", •  ”… •  "status": "400 Bad Request", •  "vary": "Accept-‐Encoding", •  "x-‐ratelimit-‐class": "api", •  "x-‐ratelimit-‐limit": "150", •  "x-‐ratelimit-‐remaining": "0", •  "x-‐ratelimit-‐reset": "1341363230", •  "x-‐runtime": "0.01126" •  }

Rate Limit Example •  Run

o  oscon2012_rate_limit_02.py

•  It iterates through a list to get followers •  List is 2072 long

•  { •  … •  "date": "Wed, 04 Jul 2012 00:54:16 GMT", •  "status": "200 OK", •  "vary": "Accept-‐Encoding", •  "x-‐frame-‐options": "SAMEORIGIN", •  "x-‐mid": "f31c7278ef8b6e28571166d359132f152289c3b8", •  "x-‐ratelimit-‐class": "api", •  "x-‐ratelimit-‐limit": "150", •  "x-‐ratelimit-‐remaining": "147", •  "x-‐ratelimit-‐reset": "1341366831", •  "x-‐runtime": "0.02768", •  "x-‐transaction": "f1bafd60112dddeb", •  "x-‐transaction-‐mask": "a6183ffa5f8ca943ff1b53b5644ef11417281dbc" •  }

Last time, it gave me 5 min. Now the reset timer is 1 hour 150 calls, not authenticated

•  { •  "cache-‐control": "no-‐cache, max-‐age=300", •  "content-‐encoding": "gzip", •  "content-‐type": "application/json; charset=utf-‐8", •  "date": "Wed, 04 Jul 2012 00:55:04 GMT", •  … •  "status": "400 Bad Request", •  "transfer-‐encoding": "chunked", •  "vary": "Accept-‐Encoding", •  "x-‐ratelimit-‐class": "api", •  "x-‐ratelimit-‐limit": "150", •  "x-‐ratelimit-‐remaining": "0", •  "x-‐ratelimit-‐reset": "1341366831", •  "x-‐runtime": "0.01342" •  }

And Rate Limit kicked-‐‑in

API with OAuth •  { •  … •  "date": "Wed, 04 Jul 2012 01:32:01 GMT", •  "etag": "\"dd419c02ed00fc6b2a825cc27wbe040\"", •  "expires": "Tue, 31 Mar 1981 05:00:00 GMT", •  "last-‐modified": "Wed, 04 Jul 2012 01:32:01 GMT", •  "pragma": "no-‐cache", •  "server": "tfe", •  … •  "status": "200 OK", •  "vary": "Accept-‐Encoding", •  "x-‐access-‐level": "read", •  "x-‐frame-‐options": "SAMEORIGIN", •  "x-‐mid": "5bbb87c04fa43c43bc9d7482bc62633a1ece381c", •  "x-‐ratelimit-‐class": "api_identified", •  "x-‐ratelimit-‐limit": "350", •  "x-‐ratelimit-‐remaining": "349", •  "x-‐ratelimit-‐reset": "1341369121", •  "x-‐runtime": "0.05539", •  "x-‐transaction": "9f8508fe4c73a407", •  "x-‐transaction-‐mask": "a6183ffa5f8ca943ff1b53b5644ef11417281dbc" •  }

OAuth “api-‐identified”

1 hr reset 350 calls

•  { •  … •  "date": "Thu, 05 Jul 2012 14:56:05 GMT", •  …

•  "x-‐ratelimit-‐class": "api_identified", •  "x-‐ratelimit-‐limit": "350", •  "x-‐ratelimit-‐remaining": "133",

•  "x-‐ratelimit-‐reset": "1341500165", •  … •  } •  ******** 2416

•  { •  … •  "date": "Thu, 05 Jul 2012 14:56:18 GMT",

•  … •  "status": "200 OK", •  …. •  "x-‐ratelimit-‐class": "api_identified",

•  "x-‐ratelimit-‐limit": "350", •  "x-‐ratelimit-‐remaining": "349", •  "x-‐ratelimit-‐reset": "1341503776", •  ******** 2417

Rate Limit resets during consecutive calls

+1 hour

Unexplained Errors •  Traceback (most recent call last): •  File "oscon2012_get_user_info_01.py", line 39, in <module> •  r = client.get(url, params=payload) •  File "build/bdist.macosx-‐10.6-‐intel/egg/requests/sessions.py", line 244, in get •  File "build/bdist.macosx-‐10.6-‐intel/egg/requests/sessions.py", line 230, in request •  File "build/bdist.macosx-‐10.6-‐intel/egg/requests/models.py", line 609, in send •  requests.exceptions.ConnectionError: HTTPSConnectionPool(host='api.twitter.com', port=443): Max

retries exceeded with url: /1/users/lookup.json?user_id=237552390%2C101237516%2C208192270%2C340183853%2C221203257%2C15254297%2C44614426%2C617136931%2C415810340%2C76071717%2C17351462%2C574253%2C35048243%2C388547381%2C254329657%2C65585979%2C253580293%2C392741693%2C126403390%2C300467007%2C8962882%2C21545799%2C15254346%2C141083469%2C340312913%2C44614485%2C600359770%2C17351519%2C38323042%2C21545828%2C86557546%2C90751854%2C128500592%2C115917681%2C42517364%2C34128760%2C15254397%2C453559166%2C92849025%2C600359811%2C17351556%2C8962952%2C296038349%2C325503810%2C122209166%2C123827693%2C59294611%2C19448725%2C21545881%2C17351581%2C130468677%2C80266144%2C15254434%2C84680859%2C65586084%2C19448741%2C15254438%2C214483879%2C48808878%2C88654768%2C15474846%2C48808887%2C334021563%2C60214090%2C134792126%2C15254464%2C558416833%2C138986435%2C264815556%2C63488965%2C17222476%2C537445328%2C97854214%2C255598755%2C65586132%2C36226009%2C187220954%2C257346383%2C15254493%2C554222558%2C302564320%2C59165520%2C44614626%2C76071907%2C80266213%2C325503825%2C403227628%2C20368210%2C17351666%2C88654836%2C340313077%2C151569400%2C302564345%2C118014971%2C11060222%2C233229141%2C13727232%2C199803906%2C220435108%2C268531201

While trying to get details of 1,000,000 users, I get this error – usually 10-‐6 AM PST Got around by “Trap & wait 5 seconds” Night Runs are relatively error free

•  { •  … •  "date": "Fri, 06 Jul 2012 03:41:09 GMT", •  "expires": "Fri, 06 Jul 2012 03:46:09 GMT", •  "server": "tfe", •  "set-‐cookie": "dnt=; domain=.twitter.com; path=/; expires=Thu, 01-‐Jan-‐1970 00:00:00 GMT", •  "status": "400 Bad Request", •  "vary": "Accept-‐Encoding", •  "x-‐ratelimit-‐class": "api_identified", •  "x-‐ratelimit-‐limit": "350", •  "x-‐ratelimit-‐remaining": "0", •  "x-‐ratelimit-‐reset": "1341546334", •  "x-‐runtime": "0.01918" •  } •  Error, sleeping •  { •  … •  "date": "Fri, 06 Jul 2012 03:46:12 GMT", •  … •  "status": "200 OK", •  … •  "x-‐ratelimit-‐class": "api_identified", •  "x-‐ratelimit-‐limit": "350", •  "x-‐ratelimit-‐remaining": "349", •  … •  }

Missed by 4 min!

OK after 5 min sleep

A Day in the life of �Twitter Rate Limit �

Strategies I have no exotic strategies, so far ! 1.  Obvious : Track elapsed time & sleep when rate limit kicks in 2.  Combine authenticated & non-‐authenticated calls 3.  Use multiple API types 4.  Cache 5.  Store & get only what is needed 6.  Checkpoint & buffer request commands 7.  Distributed data parallelism – for example AWS instances http://www.epochconverter.com/ <-‐ useful to debug the timer Pl share your tips and tricks for conserving the Rate Limit

Authentication

Authentication •  Three modes

o  Anonymous o  HTTP Basic Auth o  OAuth

•  As of Aug 31, 2010, only Anonymous or OAuth are supported

•  OAuth enables the user to authorize an application without sharing credentials

•  Also has the ability to revoke •  Twitter supports OAuth 1.0a •  OAuth 2.0 is the new standard, much simpler

o  No timeframe for Twitter support, yet

OAuth Pragmatics •  Helpful Links

o  https://dev.twitter.com/docs/auth/oauth o  https://dev.twitter.com/docs/auth/moving-‐from-‐basic-‐auth-‐to-‐oauth o  https://dev.twitter.com/docs/auth/oauth/single-‐user-‐with-‐examples o  http://blog.andydenmark.com/2009/03/how-‐to-‐build-‐oauth-‐consumer.html

•  Discussion on OAuth internal mechanisms is better left for another day

•  For headless applications to get OAuth token, go to https://dev.twitter.com/apps

•  Create an application & get four credential pieces o  Consumer Key, Consumer Secret, Access Token & Access Token Secret

•  All the frameworks have support for OAuth. So plug –in these values & use the framework’s calls

•  I used request-‐oauth library like so:

request-‐‑oauth def get_oauth_client():

consumer_key = "5dbf348aa966c5f7f07e8ce2ba5e7a3badc234bc" consumer_secret = "fceb3aedb960374e74f559caeabab3562efe97b4" access_token = "df919acd38722bc0bd553651c80674fab2b465086782Ls" access_token_secret = "1370adbe858f9d726a43211afea2b2d9928ed878" header_auth = True oauth_hook = OAuthHook(access_token, access_token_secret, consumer_key, consumer_secret, header_auth) client = requests.session(hooks={'pre_request': oauth_hook}) return client

def get_followers(user_id): url = 'https://api.twitter.com/1/followers/ids.json’ payload={"user_id":user_id} # if cursor is needed {"cursor":-‐1,"user_id":scr_name} r = requests.get(url, params=payload)

def get_followers_with_oauth(user_id,client): url = 'https://api.twitter.com/1/followers/ids.json' payload={"user_id":user_id} # if cursor is needed {"cursor":-‐1,"user_id":scr_name} r = client.get(url, params=payload)

Use the client instead of requests

Get client using the token, key & secret from dev.twitter.com/apps

Ref: h5p://pypi.python.org/pypi/requests-‐‑oauth

OAuth Authorize screen •  The user

authenticates with Twitter & grants access to Forbes Social

•  Forbes social doesn’t have the users credentials, but uses OAuth to access the user’s account

HTTP Status Codes

HTTP status Codes •  0 Never made it to Twitter Servers -‐

Library error

•  200 OK •  304 Not Modified •  400 Bad Request

o  Check error message for explanation o  REST Rate Limit !

•  401 UnAuthorized o  Beware – you could get this for other

reasons as well.

•  403 Forbidden o  Hit Update Limit (> max Tweets/day,

following too many people)

•  404 Not Found •  406 Not Acceptable •  413 Too Long •  416 Range Unacceptable •  420 Enhance Your Calm

o  Rate Limited •  500 Internal Server Error •  502 Bad Gateway

o  Down for maintenance •  503 Service Unavailable

o  Overloaded “Fail whale” •  504 Gateway Timeout

o  Overloaded

h5ps://dev.twi5er.com/docs/error-‐‑codes-‐‑responses

HTTP Status Code -‐‑ Example •  { •  "cache-‐control": "no-‐cache, max-‐age=300", •  "content-‐encoding": "gzip", •  "content-‐length": "91", •  "content-‐type": "application/json; charset=utf-‐8", •  "date": "Sat, 23 Jun 2012 00:06:56 GMT", •  "expires": "Sat, 23 Jun 2012 00:11:56 GMT", •  "server": "tfe", •  … •  "status": "401 Unauthorized", •  "vary": "Accept-‐Encoding", •  "www-‐authenticate": "OAuth realm=\"https://api.twitter.com\"", •  "x-‐ratelimit-‐class": "api", •  "x-‐ratelimit-‐limit": "0", •  "x-‐ratelimit-‐remaining": "0", •  "x-‐ratelimit-‐reset": "1340413616", •  "x-‐runtime": "0.01997" •  } •  { •  "errors": [ •  { •  "code": 53, •  "message": "Basic authentication is not supported" •  } •  ] •  }

Detailed error message in JSON !

I like this

HTTP Status Code – Confusing Example •  { •  … •  "pragma": "no-‐cache", •  "server": "tfe", •  … •  "status": "404 Not Found", •  … •  } •  { •  "errors": [ •  { •  "code": 34, •  "message": "Sorry, that page does not exist" •  } •  ] •  }

•  GET https://api.twitter.com/1/users/lookup.json?screen_nme=twitterapi,twitter&include_entities=true

•  Spelling Mistake o  Should be screen_name

•  But confusing error ! •  Should be 406 Not Acceptable or 413 Too Long ,

showing parameter error

HTTP Status Code -‐‑ Example •  { •  "cache-‐control": "no-‐cache, no-‐store, must-‐revalidate, pre-‐check=0, post-‐check=0", •  "content-‐encoding": "gzip", •  "content-‐length": "112", •  "content-‐type": "application/json;charset=utf-‐8", •  "date": "Sat, 23 Jun 2012 01:23:47 GMT", •  "expires": "Tue, 31 Mar 1981 05:00:00 GMT", •  … •  "status": "401 Unauthorized", •  "www-‐authenticate": "OAuth realm=\"https://api.twitter.com\"", •  "x-‐frame-‐options": "SAMEORIGIN", •  "x-‐ratelimit-‐class": "api", •  "x-‐ratelimit-‐limit": "150", •  "x-‐ratelimit-‐remaining": "147", •  "x-‐ratelimit-‐reset": "1340417742", •  "x-‐transaction": "d545a806f9c72b98" •  } •  { •  "error": "Not authorized", •  "request": "/1/statuses/user_timeline.json?user_id=12%2C15%2C20" •  }

Sometimes, the errors are not correct. I got this error for user_timeline.json w/ user_id=20,15,12 Clearly a parameter error (i.e. more parameters)

Objects

Users

Tweets

TimeLine

Friends Followers

Status Update Entities

Temporally Ordered

Follow Are Followed By

# media

hashtags

urls

user_mentions

embed

embed @

Places

Twitter Platform Objects

h5ps://dev.twi5er.com/docs/platform-‐‑objects

Tweets •  A.k.a Status Updates •  Interesting fields

o  Coordinates <-‐ geo location o  created_at o  entities (will see later) o  Id, id_str o  possibly sensitive o  user (will see later)

•  perspectival attributes embedded within a child object of an unlike parent – hard to maintain at scale

•  https://dev.twitter.com/docs/faq#6981

o  withheld_in_countries •  https://dev.twitter.com/blog/new-‐withheld-‐content-‐fields-‐api-‐responses

h5ps://dev.twi5er.com/docs/platform-‐‑objects/tweets

A word about id, id_str •  June 1, 2010

o  Snowflake the id generator service o  “The full ID is composed of a timestamp,

a worker number, and a sequence number”

o  Had problems with JavaScript to handle numbers > 53 bits

o  “id”:819797 o  “id_str”:”819797”

h5p://engineering.twi5er.com/2010/06/announcing-‐‑snowflake.html h5ps://groups.google.com/forum/?fromgroups#!topic/twi5er-‐‑development-‐‑talk/ahbvo3VTIYI h5ps://dev.twi5er.com/docs/twi5er-‐‑ids-‐‑json-‐‑and-‐‑snowflake

Tweets -‐‑ example •  Let us run oscon2012-‐tweets.py •  Example of tweet

o  coordinates o  id o  id_str

Users •  followers_count •  geo_enabled •  Id, Id_str •  name, screen_name •  Protected •  status, statuses_count •  withheld_in_countries

h5ps://dev.twi5er.com/docs/platform-‐‑objects/users

Users – Let us run some examples •  Run

o  oscon_2012_users.py •  Lookup users by screen_name

o  oscon12_first_20_ids.py •  Lookup users by user_id

•  Inspect the results o  id, name, status, status_count, protected, followers

(for top 10 followers), withheld users

•  Can use information for customizing the user’s screen in your web app

Entities •  Metadata & Contextual Information •  You can parse them, but Entities

parse them out as structured data •  REST API/Search API –

include_entities=1 •  Streaming API – included by default •  hashtags, media, urls,

user_mentions h5ps://dev.twi5er.com/docs/platform-‐‑objects/entities h5ps://dev.twi5er.com/docs/tweet-‐‑entities h5ps://dev.twi5er.com/docs/tco-‐‑url-‐‑wrapper

Entities •  Run

o  oscon2012_entities.py

•  Inspect hashtags, urls et al

Places •  attributes •  bounding_box •  Id (as a string!) •  country •  name

h5ps://dev.twi5er.com/docs/platform-‐‑objects/places h5ps://dev.twi5er.com/docs/about-‐‑geo-‐‑place-‐‑a5ributes

Places •  Can search for tweets near a place like so: •  Get latlong of conv center [45.52929,-‐122.66289]

o  Tweets near that place

•  Tweets near San Jose [37.395715,-‐122.102308] •  We will not see further here. But very useful

Timelines •  Collections of tweets ordered by time •  Use max_id & since_id for navigation

h5ps://dev.twi5er.com/docs/working-‐‑with-‐‑timelines

Other Objects & APIs •  Lists •  Notifications •  Friendships/exists to see if one follows the other

Users

Tweets

TimeLine

Friends Followers

Status Update Entities

Temporally Ordered

Follow Are Followed By

# media

hashtags

urls

user_mentions

embed

embed @

Places

Twitter Platform Objects

h5ps://dev.twi5er.com/docs/platform-‐‑objects

Hands-‐‑on Exercise (15 min) •  Setup environment – slide #14 •  Sanity Check Environment & Libraries

o  oscon2012_open_this_first.py o  oscon2012_rate_limit_status.py

•  Get objects (show calls) o  Lookup users by screen_name -‐ oscon12_users.py o  Lookup users by id -‐ oscon12_first_20_ids.py o  Lookup tweets -‐ oscon12_tweets.py o  Get entities -‐ oscon12_entities.py

•  Inspect the results •  Explore a little bit •  Discussion

Twi5er APIs

Twitter API

REST Streaming

Twitter REST



Twitter Search

Seach & Trend


Build Profile Create/Post Tweets Reply Favorite, Re-‐‑tweet



Rate Limit : 150/350 Rate Limit : Complexity & Frequency

Firehose

Twi5er REST API •  https://dev.twitter.com/docs/api •  What we were doing were the REST API •  Request-‐Response •  Anonymous or OAuth •  Rate Limited :

o  150/350

Twi5er Trends •  oscon2012-‐trends.py •  Trends/weekly, Trends/monthly •  Let us run some examples

o  oscon2012_trends_daily.py o  oscon2012_trends_weekly.py

•  Trends & hashtags o  #hashtag euro2012 o  http://hashtags.org/euro2012 o  http://sproutsocial.com/insights/2011/08/twitter-‐hashtags/ o  http://blog.twitter.com/2012/06/euro-‐2012-‐follow-‐all-‐action-‐on-‐pitch.html o  Top 10 : http://twittercounter.com/pages/100, http://twitaholic.com/

Brand Rank w/ Twi5er •  Walk Through & results of following

o  oscon2012_brand_01.py

•  Followed 10 user-‐brands for a few days to find growth

•  Brand Rank o  Growth of a brand w.r.t the industry o  Surge in popularity – could be due to –ve or +ve buzz. Need to understand &

correlate using Twitter APIs & metrics

•  API : url='https://api.twitter.com/1/users/lookup.json'

•  payload={"screen_name":"miamiheat,okcthunder,nba,uefacom,lovelaliga,FOXSoccer,oscon,clouderati,googleio,OReillyMedia"}

Brand Rank w/ Twi5er Clouderati is very stable

•  Google I/O showed a spike on 6/27-‐ 6/28

•  OReillyMedia shares some spike •  Looking at a few days worth of

data, our best inference is that “oscon doesn’t track with googleio”

•  “Clouderati doesn’t track at all”

Brand Rank w/ Twi5er Tech Brands

•  FOXSoccer,UEFAcom track each other

Brand Rank w/ Twi5er World of Soccer

The numbers seldom decrease. So calculating –ve velocity will not

work OTOH, if you see a –ve velocity, investigate

•  NBA, MiamiHeat, okcthunder track each other •  Used % than absolute numbers to compare •  The hike on 7/6 to 7/10 is interesting.

Brand Rank w/ Twi5er World of Basketball

•  For some reason, all numbers are going up 7/6 thru 7/10 – except for clouderati!

•  Is a rising (Twitter) tide lifting all (well, almost all) ?

Brand Rank w/ Twi5er Rising Tide …

Trivia : Search API •  Search(search.twitter.com)

o  Built by Summize which was acquired by Twitter in 2008

o  Summize described itself as “sentiment mining”

Search API •  Very simple

o  GET http://search.twitter.com/search.json?q=<blah>

•  Based on a search criteria •  “The Twitter Search API is a dedicated API for

running searches against the real-time index of recent Tweets”

•  Recent = Last 6-‐9 days worth of tweets •  Anonymous Call •  Rate Limit

o  Not No. of calls/hour, but Complexity & Frequency h5ps://dev.twi5er.com/docs/using-‐‑search h5ps://dev.twi5er.com/docs/api/1/get/search

Search API •  Filters

o  Search URL encoded o  @ = %40, #=%23 o  emoticons :) and :(, o  http://search.twitter.com/search.atom?q=sometimes+%3A) o  http://search.twitter.com/search.atom?q=sometimes+%3A(

•  Location Filters, date filters •  Content searches

Streaming API •  Not request response; but stream •  Twitter frameworks have the support •  Rate Limit : Upto 1% •  Stall warning if the client is falling behind •  Good Documentation Links

o  https://dev.twitter.com/docs/streaming-‐apis/connecting o  https://dev.twitter.com/docs/streaming-‐apis/parameters o  https://dev.twitter.com/docs/streaming-‐apis/processing

Firehose •  ~ 400 million public tweets/day •  If you are working with Twitter firehose, I envy you !

•  If you hit real limits, then explore the firehose route •  AFAIK, it is not cheap, but worth it

API Best Practices 1.  Use JSON 2.  Use user_id than screen_name

o  User_id is constant while screen_name can change

3.  max_id and since_id o  For example direct messages, if you have last message use

since_id for search o  max_id how far to go back

4.  Cache as much as you can 5.  Set the User-‐Agent header for debugging I have listed a few good blogs that have API best practices, in the reference section, at the end of this presentation

These are gathered from various books, blogs & other media, I used for this tutorial. See Reference(at the end) for the sources

Twitter API

REST Streaming

Twitter REST



Twitter Search

Seach & Trend


Build Profile Create/Post Tweets Reply Favorite, Re-‐‑tweet



Rate Limit : 150/350 Rate Limit : Complexity & Frequency

Firehose

Questions ?

Part II

SNA

Part II Twitter Network Analysis

1. Collect 3. Transform & Analyze

2. Store

4. Model &

Reason 5. Predict,

Recommend & Visualize

Validate Dataset & re-crawl/refresh

Tip: 1. Implement a

s

a staged p

ipeline,

never a monolit

h�

Tip: 3. Keep t

he

schema simple; don’t

be afraid to

transform�

Most important & the ugliest slide in

this deck !

Trivia •  Social Network Analysis originated as Sociometry &

the social network was called a sociogram •  Back then, Facebook was called SocioBinder! •  Jacob Levi Morano, is considered the originator

o  NYTimes, April 3, 1933, P. 17

Twi5er Networks-‐‑Definitions •  Nodes

o  Users o  #tags

•  Edges o  Follows o  Friends o  @mentions o  #tags

•  Directed

Twi5er Networks-‐‑Definitions •  In-‐degree

o  Followers

•  Out-‐Degree o  Friends/Follow

•  Centrality Measures •  Hubs & Authorities

o  Hubs/Directories tell us where Authorities are

o  “Of Mortals & Celebrities” is more “Twitter-‐style”

Twi5er Networks-‐‑Properties •  Concepts From Citation Networks o  Cocitation

•  Common papers that cite a paper •  Common Followers

o C & G (Followed by F & H) o  Bibliographic Coupling

•  Cite the same papers •  Common Friends (i.e. follow same person) o D, E, F & H

N

M

K

L

G

I

H

J

A

C

D F E

B

Twi5er Networks-‐‑Properties •  Concepts From Citation Networks

o  Cocitation •  Common papers that cite a paper •  Common Followers

o  C & G (Followed by F & H) o  Bibliographic Coupling

•  Cite the same papers •  Common Friends (i.e. follow same person)

o  D, E, F & H follow C o  H & F follow C & G

•  So H & F have high coupling •  Hence, if H follows A, we can

recommend F to follow A

N

M

K

L

G

I

H

J

A

CD

F E

B

Twi5er Networks-‐‑Properties •  Bipartite/Affiliation Networks

o  Two disjoint subsets o  The bipartite concept is very relevant to Twitter social graph o  Membership in Lists

•  lists vs. users bipartite graph o  Common #Tags in Tweets

•  #tags vs. members bipartite graph o  @mention together

•  ? Can this be a bipartite graph •  ? How would we fold this ?

Other Metrics & Mechanisms •  Kronecker Graphs Models

o  Kronecker product is a way of generating self-‐similar matrices o  Prof.Leskovec et al define the Kronecker product of two graphs as the Kronecker product of

their adjacency matrices o  Application : Generating models for analysis, prediction, anomaly detection et al

•  Erdos-‐Renyl Random Graphs o  Easy to build a Gn,p graph o  Assumes equal likelihood of edges between two nodes o  In a Twitter social network, we can create a more realistic expected distribution (adding the

“social reality” dimension) by inspecting the #tags & @mentions •  Network Diameter •  Weak Ties •  Follower velocity (+ve & –ve), Association strength

o  Unfollow not a reliable measure o  But an interesting property to investigate when it happens

Not covered here, but potential for an encore ! Ref: Jure Leskovec: Kronecker Graphs, Random Graphs

Twi5er Networks-‐‑Properties •  Twitter != LinkedIn, Twitter != Facebook •  Twitter Network == Interest Network •  Be cognizant of the above when you apply traditional network

properties to Twitter •  For example,

o  Six degrees of separation doesn't make sense (most of the time) in Twitter – except may be for Cliques

o  Is diameter a reliable measure for a Twitter Network ? •  Probably not

o  Do cut sets make sense ? •  Probably not

o  But citation network principles do apply; we can learn from cliques o  Bipartite graphs do make sense

Cliques (1 of 2) •  “Maximal subset of the vertices in an

undirected network such that every member of the set is connected by an edge to every other”

•  Cohesive subgroup, closely connected •  Near-‐cliques than a perfect clique (k-‐plex i.e.

connected to at least n-‐k others) •  k-‐plex clique to discover sub groups in a sparse

network; 1-‐plex being the perfect clique

Ref: Networks, An Introduction-‐‑Newman

Cliques (2 of 2) •  k-‐core – at least k others in the subset; (n-‐k)-‐plex

•  k-‐clique – no more than k distance away o  Path inside or outside the subset o  k-‐clan or k-‐club (path inside the subset)

•  We will apply k-‐plex Cliques for one of our hands-‐on

Ref: Networks, An Introduction-‐‑Newman

Sentiment Analysis •  Sentiment Analysis is an important & interesting work

on the Twitter platform o  Collect Tweets o  Opinion Estimation -‐Pass thru Classifier, Sentiment Lexicons

•  Naïve Bayes/Max Entropy Class/SVM

o  Aggregated Text Sentiment/Moving Average

•  I chose not to dive deeper because of time constraints o  Couldn’t do justice to API, Social Network and Sentiment Analysis,

all in 3 hrs

•  Next 3 Slides have couple of interesting examples

Sentiment Analysis •  Twitter Mining for Airline Sentiment •  Opinion Lexicon -‐ +ve 2000, -‐ve 4800

h5p://www.inside-‐‑r.org/howto/mining-‐‑twi5er-‐‑airline-‐‑consumer-‐‑sentiment h5p://sentiment.christopherpo5s.net/lexicons.html#opinionlexicon

Need I say more ?

h5p://www.economist.com/blogs/schumpeter/2012/06/tracking-‐‑social-‐‑media?fsrc=scn/gp/wl/bl/moodofthemarket h5p://www.relevantdata.com/pdfs/IUStudy.pdf

“A bit of clever math can uncover interes4ng pa7erns that are not visible to the human eye”

Project Ideas

Interesting Vectors of Exploration 1.  Find trending #tags & then related #tags – using

cliques over co-‐#tag-‐citation, which infers topics related to trending topics

2.  Related #tag topics over a set of tweets by a user or group of users

3.  Analysis-‐In/Out flow, Tweet Flow –  Frequent @mention

4.  Find affiliation networks by List memberships, #tags or frequent @mentions

Interesting Vectors of Exploration 5.  Use centrality measures to determine mortals vs.

celebrities 6.  Classify Tweet networks/cliques based on message

passing characteristics –  Tweets vs. Retweets, No of reweets,…

7.  Retweet Network –  Measure Influence by retweet count & frequency –  Information contagion by looking at different retweet

network subcomponents – who, when, how much,…

Twi5er Network Graph Analysis

An Example

Analysis Story Board •  @clouderati is a popular cloud related Twitter account

•  Goals: o  Analyze the social graph characteristics of the users who are

following the account •  Dig one level deep, to the followers & friends, of the

followers of @clouderati o  How many cliques ? How strong are they ? o  Does the @mention support the clique inferences ? o  What are the retweet characteristics ? o  How does the #tag network graph look like ?

In this tutorial

For you to explore !!

Twi5er Analysis Pipeline Story Board Stages, Strategies, APIs & Tasks

Stage 3

o  Get distinct user list applying the set(union(list)) operation

Stage 4

o  Get & Store User details (distinct user list)

o  Unroll

Stage 5 o  For each @clouderati

follower o  Find friend=follower -‐ set

intersection

Stage 6

o  Create social gra

ph

o  Apply network theo

ry

o  Infer cliques &

other

properties

Note: Needed a command buffer to manage scale (~980,000 users)

Note: Unroll stage took time & missteps

@clouderati Twi5er Social Graph •  Stats (Retrospect after the runs):

o  Stage 1 •  @clouderati has 2072 followers

o  Stage 2 •  Limiting followers to 5,000 per user

o  Stage 3 •  Digging 1st level (set union of followers & friends of the

followers of @clouderati) explodes into ~980,000 distinct users

o  MongoDB of the cache and intermediate datasets ~10 GB o  The database was hosted at AWS (Hi Mem XLarge – m2.xlarge ), 8

X 15 GB, Raid 10, opened to Internet with DB authentication

Code & Run Walk Through o Code:

§  oscon_2012_user_list_spider_01.py

o Challenges: §  Nothing fancy §  Get the record and store §  Would have had to recurse through a REST

cursor if there were more than 5000 followers §  @clouderati has 2072 followers

o  Interesting Points:

Stage 1

o  Get @clouderati Followers o  Store in MongoDB

Code & Run Walk Through o  Code:

§  oscon_2012_user_list_spider_02.py §  oscon_2012_twitter_utils.py §  oscon_2012_mongo.py §  oscon_2012_validate_dataset.py

o  Challenges: §  Multiple runs, errors et al !

o  Interesting Points: §  Set operation between two mongo collections for restart buffer §  Protected users, some had 0 followers, or 0 friends §  Interesting operations for validate, re-‐crawl and refresh §  Added “status_code” to differentiate protected users

§  {'$set': {'status_code': '401 Unauthorized,401 Unauthorized'}} §  Getting friends & followers of 2000 users is the hardest (or so I thought,

until I got through the next stage!)

Stage 2

o  Crawl 1 level deep o  Get friends & followers o  Validate, re-‐crawl & refresh

Validate-‐‑Recrawl-‐‑Refresh Logs •  pymongo version = 2.2 •  Connected to DB! •  … •  2075 •  Error Friends : <type 'exceptions.KeyError'> •  4ff3cd40e5557c00c7000000 -‐ none has 2072 followers & 0 friends •  Error Friends : <type 'exceptions.KeyError'> •  4ff3a958e5557cfc58000000 -‐ none has 2072 followers & 0 friends •  Error Friends : <type 'exceptions.KeyError'> •  4ff3ccdee5557c00b6000000 -‐ none has 2072 followers & 0 friends •  4ff3d3b9e5557c01b900001e -‐ 371187804 has 0 followers & 0 friends •  4ff3d3d8e5557c01b9000048 -‐ 63488295 has 155 followers & 0 friends •  4ff3d3d9e5557c01b9000049 -‐ 342712617 has 0 followers & 0 friends •  4ff3d3d9e5557c01b900004a -‐ 21266738 has 0 followers & 0 friends •  4ff3d3dae5557c01b900004b -‐ 204652853 has 0 followers & 0 friends •  … •  4ff475cfe5557c1657000074 -‐ 258944989 has 0 followers & 0 friends •  4ff475d3e5557c165700007d -‐ 327286780 has 0 followers & 0 friends •  Looks like we have 132 not so good records •  Elapsed Time = 0.546846

o  1st run – 132 bad records o  This is the classic Erlang-‐style

supervisor o  The crawl continue on transport errors

without worrying about retry o  Validate will recrawl & refresh as

needed


§  oscon2012_analytics_01.py

o Challenges: o  Figure out the right Set operations

o  Interesting Points: §  973,323 unique users ! §  Recursively apply set union over 400,00 lists §  Set operations took slightly more than a minute

Stage 3

o  Get distinct user list applying the set(union(list)) operation

Code & Run Walk Through o  Code:

§  oscon2012_analytics_01.py (focus on cmd string creation) §  oscon2012_get_user_info_01.py §  oscon2012_unroll_user_list_01.py §  oscon2012_unroll_user_list_02.py

o  Challenges: §  Where do I start ?

•  In the next few slides §  Took me a few days to get it right (along with my daily job!) §  Unfortunately I did not employ parallelism & didn’t use my

MacPro with 32 GB memory. So the runs were long §  But learned hard lessons on check point & restart

o  Interesting Points: §  Tracking Control Numbers §  Time … Marathon unroll run 19:33:33 !

Stage 4

o  Get & Store User details (distinct user list)

o  Unroll

Twi5er @ scale Pa5ern •  Challenge:

o  You want to get screen names, follower counts and other details for a million users

•  Problem: o  No easy REST API o  https://api.twitter.com/1/users/lookup.json will take 100 user_ids and give

details

•  Solution: o  This is a scalability challenge. Approach it like so o  Create a command buffer collection in MongoDB splitting millon user_ids

into batches of 100 o  Have a “done” flag initialized to 0 for checkpoint & restart o  After each cmd str is executed, rest “done”:1 o  For subsequent runs, ignore “done”:1. o  Also helps in control number tracking

Control numbers

Control Numbers •  > db.t_users_info.count() •  8122 •  > db.api_str.count({"done":0,"seq_no":{"$lt":8185}},{"seq_no”:) •  63 •  > db.api_str.find({"done":0,"seq_no":{"$lt":8185}},{"seq_no":1}) •  { "_id" : ObjectId("4ff4daeae5557c28bf001d53"), "seq_no" : 5433 } •  { "_id" : ObjectId("4ff4daeae5557c28bf001d59"), "seq_no" : 5439 } •  { "_id" : ObjectId("4ff4daeae5557c28bf001d5f"), "seq_no" : 5445 } •  { "_id" : ObjectId("4ff4daebe5557c28bf001d74"), "seq_no" : 5466 } •  { "_id" : ObjectId("4ff4daece5557c28bf001d7a"), "seq_no" : 5472 } •  { "_id" : ObjectId("4ff4daece5557c28bf001d80"), "seq_no" : 5478 } •  { "_id" : ObjectId("4ff4daede5557c28bf001d90"), "seq_no" : 5494 } •  { "_id" : ObjectId("4ff4daefe5557c28bf001daf"), "seq_no" : 5525 } •  { "_id" : ObjectId("4ff4daf0e5557c28bf001dba"), "seq_no" : 5536 } •  { "_id" : ObjectId("4ff4daf1e5557c28bf001dcf"), "seq_no" : 5557 } •  { "_id" : ObjectId("4ff4daf2e5557c28bf001de9"), "seq_no" : 5583 } •  { "_id" : ObjectId("4ff4daf2e5557c28bf001def"), "seq_no" : 5589 } •  { "_id" : ObjectId("4ff4daf4e5557c28bf001e0e"), "seq_no" : 5620 } •  { "_id" : ObjectId("4ff4daf4e5557c28bf001e14"), "seq_no" : 5626 } •  { "_id" : ObjectId("4ff4daf6e5557c28bf001e2e"), "seq_no" : 5652 } •  { "_id" : ObjectId("4ff4daf6e5557c28bf001e39"), "seq_no" : 5663 } •  { "_id" : ObjectId("4ff4daf8e5557c28bf001e62"), "seq_no" : 5704 } •  { "_id" : ObjectId("4ff4dafae5557c28bf001e77"), "seq_no" : 5725 } •  { "_id" : ObjectId("4ff4dafae5557c28bf001e81"), "seq_no" : 5735 } •  { "_id" : ObjectId("4ff4dawe5557c28bf001e9b"), "seq_no" : 5761 } •  Type "it" for more •  > it •  { "_id" : ObjectId("4ff4dafce5557c28bf001ea6"), "seq_no" : 5772 } •  { "_id" : ObjectId("4ff4dafce5557c28bf001eac"), "seq_no" : 5778 } •  { "_id" : ObjectId("4ff4dafde5557c28bf001eb7"), "seq_no" : 5789 } •  { "_id" : ObjectId("4ff4dafde5557c28bf001ebd"), "seq_no" : 5795 } •  { "_id" : ObjectId("4ff4dafee5557c28bf001ec8"), "seq_no" : 5806 } •  { "_id" : ObjectId("4ff4daffe5557c28bf001ed8"), "seq_no" : 5822 } •  { "_id" : ObjectId("4ff4db00e5557c28bf001eed"), "seq_no" : 5843 } •  { "_id" : ObjectId("4ff4db00e5557c28bf001ef3"), "seq_no" : 5849 } •  { "_id" : ObjectId("4ff4db01e5557c28bf001efe"), "seq_no" : 5860 } •  { "_id" : ObjectId("4ff4db01e5557c28bf001f09"), "seq_no" : 5871 } •  { "_id" : ObjectId("4ff4db03e5557c28bf001f23"), "seq_no" : 5897 } •  { "_id" : ObjectId("4ff4db05e5557c28bf001f47"), "seq_no" : 5933 } •  { "_id" : ObjectId("4ff4db05e5557c28bf001f52"), "seq_no" : 5944 } •  { "_id" : ObjectId("4ff4db06e5557c28bf001f58"), "seq_no" : 5950 } •  { "_id" : ObjectId("4ff4db06e5557c28bf001f5e"), "seq_no" : 5956 } •  { "_id" : ObjectId("4ff4db06e5557c28bf001f69"), "seq_no" : 5967 } •  { "_id" : ObjectId("4ff4db07e5557c28bf001f74"), "seq_no" : 5978 } •  { "_id" : ObjectId("4ff4db07e5557c28bf001f7f"), "seq_no" : 5989 } •  { "_id" : ObjectId("4ff4db0ae5557c28bf001fa8"), "seq_no" : 6030 } •  { "_id" : ObjectId("4ff4db0ae5557c28bf001fae"), "seq_no" : 6036 } •  Type "it" for more •  > it •  { "_id" : ObjectId("4ff4db0ae5557c28bf001w9"), "seq_no" : 6047 } •  { "_id" : ObjectId("4ff4db0be5557c28bf001fc4"), "seq_no" : 6058 } •  { "_id" : ObjectId("4ff4db0be5557c28bf001fca"), "seq_no" : 6064 } •  { "_id" : ObjectId("4ff4db0de5557c28bf001fe0"), "seq_no" : 6086 } •  { "_id" : ObjectId("4ff4db0de5557c28bf001fe6"), "seq_no" : 6092 } •  { "_id" : ObjectId("4ff4db0de5557c28bf001fec"), "seq_no" : 6098 } •  { "_id" : ObjectId("4ff4db0ee5557c28bf002006"), "seq_no" : 6124 } •  { "_id" : ObjectId("4ff4db10e5557c28bf002025"), "seq_no" : 6155 } •  { "_id" : ObjectId("4ff4db12e5557c28bf002044"), "seq_no" : 6186 } •  { "_id" : ObjectId("4ff4db12e5557c28bf00204a"), "seq_no" : 6192 } •  { "_id" : ObjectId("4ff4db1ae5557c28bf0020e0"), "seq_no" : 6342 } •  { "_id" : ObjectId("4ff4db1ae5557c28bf0020e1"), "seq_no" : 6343 } •  { "_id" : ObjectId("4ff4db2ee5557c28bf002240"), "seq_no" : 6694 } •  { "_id" : ObjectId("4ff4db34e5557c28bf0022b9"), "seq_no" : 6815 } •  { "_id" : ObjectId("4ff4db41e5557c28bf00239f"), "seq_no" : 7045 } •  { "_id" : ObjectId("4ff4db53e5557c28bf0024fe"), "seq_no" : 7396 } •  { "_id" : ObjectId("4ff4db66e5557c28bf00265d"), "seq_no" : 7747 } •  { "_id" : ObjectId("4ff4db68e5557c28bf002678"), "seq_no" : 7774 } •  { "_id" : ObjectId("4ff4db6be5557c28bf0026af"), "seq_no" : 7829 } •  >

The collection should have 8185 documents But it has only 8122. Where did the rest go ?

63 of them still have done=0 8122 + 63 = 8185 ! Aha, mystery solved. They fell through the cracks Need a catch-‐all final run

Day in the life of a Control Number Detective – Run #1 •  Remember : 973,323 users. So, 9734 cmd strings (100 users perstring) •  > > db.api_str.count() •  9831 •  > db.api_str.count({"done":0}) •  239

•  >> db.t_users_info.count() •  9592 •  > > db.api_str.count({"api_str":""}) •  97 •  So we should have 9831 – 97 = 9734 records •  The second run should generate 9734-‐9592 = 142 calls (i.e. 350-‐142=208 rate-‐limit should remain). Let us see. •  { •  …

•  "x-‐ratelimit-‐class": "api_identified", •  "x-‐ratelimit-‐limit": "350", •  "x-‐ratelimit-‐remaining": "209", •  … •  } •  Yep, 209 left •  >

Day in the life of a Control Number Detective – Run #2 •  Remember : 973,323 users. So, 9734 cmd strings (100 users perstring) •  > db.t_users_info.count() •  9728 •  > db.api_str.count({"api_str":""}) •  97

•  > db.api_str.count({"done":0}) •  103 •  >9734-‐9728=6, same as 103-‐97 ! •  Run once more ! •  > db.api_str.find({"done":0},{"seq_no":1}) •  … •  { "_id" : ObjectId("4ff4dbd4e5557c28bf002e22"), "seq_no" : 9736 } •  { "_id" : ObjectId("4ff4db05e5557c28bf001f47"), "seq_no" : 5933 }

•  { "_id" : ObjectId("4ff4db8be5557c28bf0028f6"), "seq_no" : 8412 } •  { "_id" : ObjectId("4ff4dba2e5557c28bf002a8c"), "seq_no" : 8818 } •  { "_id" : ObjectId("4ff4dbaee5557c28bf002b69"), "seq_no" : 9039 } •  { "_id" : ObjectId("4ff4dbb8e5557c28bf002c1c"), "seq_no" : 9218 } •  …

•  { •  … •  "x-‐ratelimit-‐limit": "350", •  "x-‐ratelimit-‐remaining": "344", •  …

•  } •  Yep, 6 more records •  > db.t_users_info.count() •  9734

•  Good, got 9734 !

Professor Layton would be proud !

In fact, I have all the four & plan to spend sometime with them & Laphraig !

Monitor runs & track control numbers

Unroll run 8:48 PM to ~4:08 PM next day !

Track error & the document numbers


§  oscon2012_find_strong_ties_01.py §  oscon2012_social_graph_stats_01.py

o Challenges: §  None. Python set operations made this easy

o  Interesting Points: §  Even at this scale, single machine is not enough §  Should have tried data parallelism

•  This task is well suited to leverage data parallelism as it is commutative & associative

•  Was getting invalid cursor error from MongoDB •  So had to do the updates in two steps

Stage 5

o  For each @clouderati follower

o  Find friend=follower -‐ set intersection


§  oscon2012_find_cliques_01.py

o Challenges: o  Lots of good information hidden in

the data ! o  Memory !

o  Interesting Points: o  Graph, List & set operations o  networkx has lots of interesting

graph algorithms o  Collections.Counter to the rescue

Stage 6

o  Create social graph o  Apply network theory o  Infer cliques & other

properties

Twi5er Social Graph Analysis of @clouderati

o  2072 Followers; 973,323 unique users one level down w/ followers/friends trimmed at 5,000

o  Strong ties o  follower=friend

o  235,697 users, 462, 419 edges o  501,367 Cliques o  253 unique users 8,906 Cliques w/ >

10 users o  GeorgeReese in 7,973 of them ! See

List for 1st 125 o  krishnan 3,446,randy 2,197, joe 1,977,

sam 1,937, jp 485, stu 403, urquhart 263,beaker 226, acroll 149, adrian 63, gevaperry 24

o  Of course, clique analysis does not tell us the whole story …

Clique Distribution = {2: 296521, 3: 58368, 4: 36421, 5: 28788, 6: 24197, 7: 20240, 8: 15997, 9: 11929, 10: 6576, 11: 1909, 12: 364, 13: 55, 14: 2}


o  sort by followers vs. sort by strong ties is interesting

Celebrity – very low strong ties

Medium Celebrity, medium strong ties

Higher Celebrity, low strong ties

Twi5er Social Graph Analysis of @clouderati o  A higher “Strong Ties”

number is interesting §  It means a very high

follower-‐friend intersection

§  Reeves 62%, bgolden 85%

o  Bur a high clique with a smaller “Strong ties” show more cohesive & stronger social graph §  eg.Krishnan -‐ 15%

friends-‐followers §  Samj – 33%


o  Ideas for more Exploration §  Include all

followers (instead of stopping at the 5000 cap)

§  Get tweets & track @mention

§  Frequent @mention shows more stronger ties

§  #tag analysis could show some interesting networks

Twitter Tips – A Baker’s Dozen 1.  Twitter APIs are (more or less) congruent & symmetric 2.  Twitter is usually right & simple -‐ recheck when you get unexpected results

before blaming Twitter o  I was getting numbers when I was expecting screen_names in user objects. o  Was ready to send blasting e-‐mails to Twitter team. Decided to check one more time

and found that my parameter key was wrong-‐screen_name instead of user_id o  Always test with one or two records before a long run ! - learned the hard way

3.  Twitter APIs are very powerful – consistent use can bear huge data o  In a week, you can pull in 4-‐5 million users & some tweets ! o  Night runs are far more faster & error-free

4.  Use a NOSQL data store as a command buffer & data buffer o  Would make it easy to work with Twitter at scale o  I use MongoDB o  Keep the schema simple & no fancy transformation

•  And as far as possible same as the (json) response o  Use NOSQL CLI for trimming records et al

The Beginning As The End

Twitter Tips – A Baker’s Dozen 5.  Always use a big data pipeline

o  Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize o  That way you can orthogonally extend, with functional components like command buffers,

validation et al 6.  Use functional approach for a scalable pipeline

o  Compose your data big pipeline with well defined granular functions, each doing only one thing o  Don’t overload the functional components (i.e. no collect, unroll & store as a single component) o  Have well defined functional components with appropriate caching, buffering, checkpoints &

restart techniques •  This did create some trouble for me, as we will see later

7.  Crawl-‐Store-‐Validate-‐Recrawl-‐Refresh cycle o  The equivalent of the traditional ETL o  Validation stage & validation routines are important

•  Cannot expect perfect runs •  Cannot manually look at data either, when data is at scale

8.  Have control numbers to validate runs & monitor them o  I still remember control numbers which start with the number of punch cards in the input deck &d then follow that

number through the various runs ! o  There will be a separate printout of the control numbers that will be kept in the operations files

Twitter Tips – A Baker’s Dozen 9.  Program defensively

o  more so for a REST-based-Big Data-Analytics systems o  Expect failures at the transport layer & accommodate for them

10.  Have Erlang-‐style supervisors in your pipeline o  Fail fast & move on o  Don’t linger and try to fix errors that cannot be controlled at that layer o  A higher layer process will circle back and do incremental runs to

correct missing spiders and crawls o  Be aware of visibility & lack of context. Validate at the lowest layer that

has enough context to take corrective actions o  I have an example in part 2

11.  Data will never be perfect o  Know your data & accommodate for it’s idiosyncrasies

•  for example: 0 followers, protected users, 0 friends,…

Twitter Tips – A Baker’s Dozen 12.  Check Point frequently (preferably after ever API call) & have a

re-‐startable command buffer cache o  See a MongoDB example in Part 2

13.  Don’t bombard the URL o  Wait a few seconds before successful calls. This will end up with a

scalable system, eventually o  I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to

work with 5 seconds with wait & retry. Then, the rate limit started kicking in ! 14.  Always measure the elapsed time of your API runs & processing

o  Kind of early warning when something is wrong

15.  Develop incrementally; don’t fail to check “cut & paste” errors

Twitter Tips – A Baker’s Dozen 16.  The Twitter big data pipeline has lots of opportunities for parallelism

o  Leverage data parallelism frameworks like MapReduce o  But first :

§  Prototype as a linear system, §  Optimize and tweak the functional modules & cache strategies, §  Note down stages and tasks that can be parallelized and §  Then parallelize them

o  For the example project, we will see later, I did not leverage any parallel frameworks, but the opportunities were clearly evident. I will point them out, as we progress through the tutorial

17.  Pay attention to handoffs between stages o  They might require transformation – for example collect & store might store a user list

as multiple arrays, while the model requires each user to be a document for aggregation

o  But resist the urge to overload collect with transform o  i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform

the array to separate documents o  Add transformation as a granular function – of course, with appropriate buffering, caching,

checkpoints & restart techniques 18.  Have a good log management system to capture and wade through

logs

Twitter Tips – A Baker’s Dozen 19.  Understand the underlying network characteristics for the

inference you want to make o  Twitter Network != Facebook Network , Twitter Graph != LinkedIn Graph

o  Twitter Network is more of an Interest Network o  So, many of the traditional network mechanisms & mechanics, like network

diameter & degrees of separation, might not make sense o  But, others like Cliques and Bipartite Graphs do

Twitter Gripes 1.  Need more rich APIs for #tags

o  Somewhat similar to users viz. followers, friends et al o  Might make sense to make #tags a top level object with it’s own semantics

2.  HTTP Error Return is not uniform o  Returns 400 bad Request instead of 420 o  Granted, there is enough information to figure this out

3.  Need an easier way to get screen_name from user_id 4.  “following” vs. “friends_count” i.e. “following” is a dummy variable.

o  There are a few like this, most probably for backward compatibility 5.  Parameter Validation is not uniform

o  Gives “404 Not found” instead of “406 Not Acceptable” or “416 Range Unacceptable”

6.  Overall more validation would help o  Granted, it is more of growing pains. Once one comes across a few inconsistencies, the

rest is easy to figure out

Thanks To these Giants …

I had a good time researching &

preparing for this Tutorial. ��

I hope you learned a few new things &

have a few vectors to follow

The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Technology

Transcript of The Art of Social Media Analysis with Twitter & Python-OSCON 2012