The Art of Social Media Analysis with Twitter & Python-OSCON 2012
-
Upload
oscon-byrum -
Category
Technology
-
view
3.768 -
download
8
description
Transcript of The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis �
with Twitter & Python
krishna sankar @ksankar
http://www.oscon.com/oscon2012/public/schedule/detail/23130
Intro
API, Objects,…
Twitter Network Analysis Pipeline
@mention network
Growth, weakties
Rewteeet analytics, Information contagion
Cliques, social graph
NLP, NLTK, Sentiment Analysis
#tag Network
o House Rules (1 of 2) o Doesn’t assume any knowledge
of Twitter API o Goal: Everybody in the same
page & get a working knowledge of Twitter API
o To bootstrap your exploration into Social Network Analysis & Twitter
o Simple programs, to illustrate usage & data manipulation
We will analyze @clouderati, 2072 followers, exploding to ~980,000 distinct users down one level
Intro
API, Objects,…
Twitter Network Analysis Pipeline
@mention network
Growth, weakties
Rewteeet analytics, Information contagion
Cliques, social graph
NLP, NLTK, Sentiment Analysis
#tag Network
o House Rules (2 of 2) o Am using the requests library o There are good Twitter frameworks
for python, but wanted to build from the basics. Once one understands the fundamentals, frameworks can help
o Many areas to explore – not enough time. So decided to focus on social graph, cliques & networkx
We will analyze @clouderati,2072 followers, exploding to ~980,000 distinct users down one level
About Me • Lead Engineer/Data Scientist/AWS Ops Guy at
Genophen.com o Co-‐chair – 2012 IEEE Precision Time Synchronization
• http://www.ispcs.org/2012/index.html o Blog : http://doubleclix.wordpress.com/ o Quora : http://www.quora.com/Krishna-‐Sankar
• Prior Gigs o Lead Architect (Egnyte) o Distinguished Engineer (CSCO) o Employee #64439 (CSCO) to #39(Egnyte) & now #9 !
• Current Focus: o Design, build & ops of BioInformatics/Consumer Infrastructure on AWS,
MongoDB, Solr, Drupal,GitHub,… o Big Data (more of variety, variability, context & graphs, than volume or velocity –
so far !) o Overlay based semantic search & ranking
• Other related Presentations o http://goo.gl/P1rhc Big Data Engineering Top 10 Pragmatics (Summary) o http://goo.gl/0SQDV The Art of Big Data (Detailed) o http://goo.gl/EaUKH The Hitchhiker’s Guide to Kaggle OSCON 2011 Tutorial
Twitter Tips – A Baker’s Dozen 1. Twitter APIs are (more or less) congruent & symmetric 2. Twitter is usually right & simple -‐ recheck when you get unexpected results
before blaming Twitter o I was getting numbers when I was expecting screen_names in user objects. o Was ready to send blasting e-‐mails to Twitter team. Decided to check one more time
and found that my parameter key was wrong-‐screen_name instead of user_id o Always test with one or two records before a long run ! - learned the hard way
3. Twitter APIs are very powerful – consistent use can bear huge data o In a week, you can pull in 4-‐5 million users & some tweets ! o Night runs are far more faster & error-free
4. Use a NOSQL data store as a command buffer & data buffer o Would make it easy to work with Twitter at scale o I use MongoDB o Keep the schema simple & no fancy transformation
• And as far as possible same as the (json) response o Use NOSQL CLI for trimming records et al
The End As The Beginning
Twitter Tips – A Baker’s Dozen 5. Always use a big data pipeline
o Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize o That way you can orthogonally extend, with functional components like command buffers,
validation et al 6. Use functional approach for a scalable pipeline
o Compose your data big pipeline with well defined granular functions, each doing only one thing o Don’t overload the functional components (i.e. no collect, unroll & store as a single component) o Have well defined functional components with appropriate caching, buffering, checkpoints &
restart techniques • This did create some trouble for me, as we will see later
7. Crawl-‐Store-‐Validate-‐Recrawl-‐Refresh cycle o The equivalent of the traditional ETL o Validation stage & validation routines are important
• Cannot expect perfect runs • Cannot manually look at data either, when data is at scale
8. Have control numbers to validate runs & monitor them o I still remember control numbers which start with the number of punch cards in the input deck &d then follow that
number through the various runs ! o There will be a separate printout of the control numbers that will be kept in the operations files
Twitter Tips – A Baker’s Dozen 9. Program defensively
o more so for a REST-based-Big Data-Analytics systems o Expect failures at the transport layer & accommodate for them
10. Have Erlang-‐style supervisors in your pipeline o Fail fast & move on o Don’t linger and try to fix errors that cannot be controlled at that layer o A higher layer process will circle back and do incremental runs to
correct missing spiders and crawls o Be aware of visibility & lack of context. Validate at the lowest layer that
has enough context to take corrective actions o I have an example in part 2
11. Data will never be perfect o Know your data & accommodate for it’s idiosyncrasies
• for example: 0 followers, protected users, 0 friends,…
Twitter Tips – A Baker’s Dozen 12. Check Point frequently (preferably after ever API call) & have a
re-‐startable command buffer cache o See a MongoDB example in Part 2
13. Don’t bombard the URL o Wait a few seconds before successful calls. This will end up with a
scalable system, eventually o I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to
work with 5 seconds with wait & retry. Then, the rate limit started kicking in ! 14. Always measure the elapsed time of your API runs & processing
o Kind of early warning when something is wrong
15. Develop incrementally; don’t fail to check “cut & paste” errors
Twitter Tips – A Baker’s Dozen 16. The Twitter big data pipeline has lots of opportunities for parallelism
o Leverage data parallelism frameworks like MapReduce o But first :
§ Prototype as a linear system, § Optimize and tweak the functional modules & cache strategies, § Note down stages and tasks that can be parallelized and § Then parallelize them
o For the example project, we will see later, I did not leverage any parallel frameworks, but the opportunities were clearly evident. I will point them out, as we progress through the tutorial
17. Pay attention to handoffs between stages o They might require transformation – for example collect & store might store a user list
as multiple arrays, while the model requires each user to be a document for aggregation
o But resist the urge to overload collect with transform o i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform
the array to separate documents o Add transformation as a granular function – of course, with appropriate buffering, caching,
checkpoints & restart techniques 18. Have a good log management system to capture and wade through
logs
Twitter Tips – A Baker’s Dozen 19. Understand the underlying network characteristics for the
inference you want to make o Twitter Network != Facebook Network , Twitter Graph != LinkedIn Graph
o Twitter Network is more of an Interest Network o So, many of the traditional network mechanisms & mechanics, like network
diameter & degrees of separation, might not make sense o But, others like Cliques and Bipartite Graphs do
Twitter Gripes 1. Need more rich APIs for #tags
o Somewhat similar to users viz. followers, friends et al o Might make sense to make #tags a top level object with it’s own semantics
2. HTTP Error Return is not uniform o Returns 400 bad Request instead of 420 o Granted, there is enough information to figure this out
3. Need an easier way to get screen_name from user_id 4. “following” vs. “friends_count” i.e. “following” is a dummy variable.
o There are a few like this, most probably for backward compatibility 5. Parameter Validation is not uniform
o Gives “404 Not found” instead of “406 Not Acceptable” or “413 Too Long” or “416 Range Unacceptable”
6. Overall more validation would help o Granted, it is more of growing pains. Once one comes across a few inconsistencies, the
rest is easy to figure out
A Fork
• Not enough time for both
• NLP,NLTK & d
eep
into Tweets
o Sen4ment
Analysis
• I chose the Social Graph route
A minute about Twitter as platform & it’s evolution
My Wish & Hope • I spend a lot of time with Twitter & derive value; the platform is rich & the APIs intuitive • I did like the fact that tweets are part of LinkedIn. I still used Twitter more than LinkedIn
o I don’t think showing Tweets in LinkedIn took anything away from the Twitter experience o LinkedIn experience & Twitter experience are different & distinct. Showing tweets in LinkedIn didn’t change that
• I sincerely hope that the platform grows with a rich developer eco system • Orthogonally extensible platform is essential
• Of course, along with a congruent user experience – “ … core Twitter consumption experience through consistent tools”
https://dev.t
witter.com/blog/
delivering-‐c
onsistent-‐tw
itter-‐
experience “The micro-blogging service must find the
right balance of running a profitable business and maintaining a robust developers' community.” – Chenda, CBS news!
“.. we want to make sure that the Twitter experience is straightforward and easy to understand -- whether you’re on
Twitter.com or elsewhere on the web”-Michael!
Setup • For Hands on Today
o Python 2.7.3 o easy_install –v requests
• http://docs.python-‐requests.org/en/latest/user/quickstart/#make-‐a-‐request
o easy_install –v requests-‐oauth o Hands on programs at https://github.com/xsankar/oscon2012-‐handson
• For advanced data science with social graphs o easy_install –v networkx
o easy_install –v numpy o easy_install –v nltk
• Not for this tutorial, but good for sentiment analysis et al o Mongodb
• I used MongoDB in AWS m2.xlarge, RAID 10 X 8 X 15 GB EBS o graphviz -‐ http://www.graphviz.org/; easy_install pygraphviz o easy_install pydot
Thanks To these Giants …
Problem Domain For this tutorial • Data Science (trends, analytics et al) on Social Networks as
observed by Twitter primitives o Not for Twitter based apps for real time tweets o Not web sites with real time tweets
• By looking at the domain in aggregate to derive inferences & actionable recommendations
• Which also means, you need to be deliberate & systemic ( i.e. not look at a fluctuation as a trend but dig deeper before pronouncing a trend)
Agenda I. Mechanics : Twitter API (1:30 PM -‐ 3:00 PM)
o Essential Fundamentals (Rate Limit, HTTP Codes et al) o Objects o API o Hands-‐on (2:45 PM -‐ 3:00 PM)
II. Break (3:00 PM -‐ 3:30 PM) III. Twitter Social Graph Analysis (3:30 PM -‐ 5:00 PM)
o Underlying Concepts o Social Graph Analysis of @clouderati
§ Stages, Strategies & Tasks § Code Walk thru
Open This First
Twi5er API : Read These First • Using Twitter Brand
o New logo & associated guidelines : https://twitter.com/about/logos o Twitter Rules :
https://support.twitter.com/groups/33-‐report-‐a-‐violation/topics/121-‐guidelines-‐best-‐practices/articles/18311-‐the-‐twitter-‐rules
o Developer Rules of the road https://dev.twitter.com/terms/api-‐terms • Read These Links First
1. https://dev.twitter.com/docs/things-‐every-‐developer-‐should-‐know 2. https://dev.twitter.com/docs/faq 3. Field Guide to Objects https://dev.twitter.com/docs/platform-‐objects 4. Security https://dev.twitter.com/docs/security-‐best-‐practices 5. Media Best Practices : https://dev.twitter.com/media 6. Consolidates Page : https://dev.twitter.com/docs 7. Streaming APIs https://dev.twitter.com/docs/streaming-‐apis 8. How to Appeal (Not that you all would need it !) https://support.twitter.com/
articles/72585 • Only One version of Twitter APIs
API Status Page
• https://dev.twitter.com/status • https://dev.twitter.com/issues • https://dev.twitter.com/discussions
h5ps://dev.twi5er.com/status
http://www.buzzfeed.com/tommywilhelm/google-‐users-‐being-‐total-‐dicks-‐about-‐the-‐twitter
Open This First • Install pre-‐req as per the setup slide • Run
o oscon2012_open_this_first.py o To test connectivity – “canary query”
• Run o oscon2012_rate_limit_status.py o Use http://www.epochconverter.com to check reset_time
• Formats xml, json, atom & rss
Twitter API
REST Streaming
Twitter REST
Core Data, Core Twitter Objects
Near-realtime, High Volume
Twitter Search
Seach & Trend
Keywords Specific User Trends
Build Profile Create/Post Tweets Reply Favorite, Re-‐tweet
Public Streams User Streams Site Streams
Follow users, topics, data mining
Rate Limit : 150/350
Rate Limit : Complexity & Frequency
Firehose
Rate Limit
Rate Limits • By API type & Authentication Mode
API No authC authC Error
REST 150/hr 350/hr 400
Search Complexity & Frequency
-‐N/A-‐ 420
Streaming Upto 1%
Fire hose none none
Rate Limit Header • { • "status": "200 OK", • "vary": "Accept-‐Encoding", • "x-‐frame-‐options": "SAMEORIGIN", • "x-‐mid": "8e775a9323c45f2a541eeb4d2d1eb9b468w81c6", • "x-‐ratelimit-‐class": "api", • "x-‐ratelimit-‐limit": "150", • "x-‐ratelimit-‐remaining": "149", • "x-‐ratelimit-‐reset": "1340467358", • "x-‐runtime": "0.04144", • "x-‐transaction": "2b49ac31cf8709af", • "x-‐transaction-‐mask":
"a6183ffa5f8ca943ff1b53b5644ef114df9d6bba" • }
Rate Limit-‐‑ed Header • { • "cache-‐control": "no-‐cache, max-‐age=300", • "content-‐encoding": "gzip", • "content-‐length": "150", • "content-‐type": "application/json; charset=utf-‐8", • "date": "Wed, 04 Jul 2012 00:48:25 GMT", • "expires": "Wed, 04 Jul 2012 00:53:25 GMT", • "server": "tfe", • ”… • "status": "400 Bad Request", • "vary": "Accept-‐Encoding", • "x-‐ratelimit-‐class": "api", • "x-‐ratelimit-‐limit": "150", • "x-‐ratelimit-‐remaining": "0", • "x-‐ratelimit-‐reset": "1341363230", • "x-‐runtime": "0.01126" • }
Rate Limit Example • Run
o oscon2012_rate_limit_02.py
• It iterates through a list to get followers • List is 2072 long
• { • … • "date": "Wed, 04 Jul 2012 00:54:16 GMT", • "status": "200 OK", • "vary": "Accept-‐Encoding", • "x-‐frame-‐options": "SAMEORIGIN", • "x-‐mid": "f31c7278ef8b6e28571166d359132f152289c3b8", • "x-‐ratelimit-‐class": "api", • "x-‐ratelimit-‐limit": "150", • "x-‐ratelimit-‐remaining": "147", • "x-‐ratelimit-‐reset": "1341366831", • "x-‐runtime": "0.02768", • "x-‐transaction": "f1bafd60112dddeb", • "x-‐transaction-‐mask": "a6183ffa5f8ca943ff1b53b5644ef11417281dbc" • }
Last time, it gave me 5 min. Now the reset timer is 1 hour 150 calls, not authenticated
• { • "cache-‐control": "no-‐cache, max-‐age=300", • "content-‐encoding": "gzip", • "content-‐type": "application/json; charset=utf-‐8", • "date": "Wed, 04 Jul 2012 00:55:04 GMT", • … • "status": "400 Bad Request", • "transfer-‐encoding": "chunked", • "vary": "Accept-‐Encoding", • "x-‐ratelimit-‐class": "api", • "x-‐ratelimit-‐limit": "150", • "x-‐ratelimit-‐remaining": "0", • "x-‐ratelimit-‐reset": "1341366831", • "x-‐runtime": "0.01342" • }
And Rate Limit kicked-‐‑in
API with OAuth • { • … • "date": "Wed, 04 Jul 2012 01:32:01 GMT", • "etag": "\"dd419c02ed00fc6b2a825cc27wbe040\"", • "expires": "Tue, 31 Mar 1981 05:00:00 GMT", • "last-‐modified": "Wed, 04 Jul 2012 01:32:01 GMT", • "pragma": "no-‐cache", • "server": "tfe", • … • "status": "200 OK", • "vary": "Accept-‐Encoding", • "x-‐access-‐level": "read", • "x-‐frame-‐options": "SAMEORIGIN", • "x-‐mid": "5bbb87c04fa43c43bc9d7482bc62633a1ece381c", • "x-‐ratelimit-‐class": "api_identified", • "x-‐ratelimit-‐limit": "350", • "x-‐ratelimit-‐remaining": "349", • "x-‐ratelimit-‐reset": "1341369121", • "x-‐runtime": "0.05539", • "x-‐transaction": "9f8508fe4c73a407", • "x-‐transaction-‐mask": "a6183ffa5f8ca943ff1b53b5644ef11417281dbc" • }
OAuth “api-‐identified”
1 hr reset 350 calls
• { • … • "date": "Thu, 05 Jul 2012 14:56:05 GMT", • …
• "x-‐ratelimit-‐class": "api_identified", • "x-‐ratelimit-‐limit": "350", • "x-‐ratelimit-‐remaining": "133",
• "x-‐ratelimit-‐reset": "1341500165", • … • } • ******** 2416
• { • … • "date": "Thu, 05 Jul 2012 14:56:18 GMT",
• … • "status": "200 OK", • …. • "x-‐ratelimit-‐class": "api_identified",
• "x-‐ratelimit-‐limit": "350", • "x-‐ratelimit-‐remaining": "349", • "x-‐ratelimit-‐reset": "1341503776", • ******** 2417
Rate Limit resets during consecutive calls
+1 hour
Unexplained Errors • Traceback (most recent call last): • File "oscon2012_get_user_info_01.py", line 39, in <module> • r = client.get(url, params=payload) • File "build/bdist.macosx-‐10.6-‐intel/egg/requests/sessions.py", line 244, in get • File "build/bdist.macosx-‐10.6-‐intel/egg/requests/sessions.py", line 230, in request • File "build/bdist.macosx-‐10.6-‐intel/egg/requests/models.py", line 609, in send • requests.exceptions.ConnectionError: HTTPSConnectionPool(host='api.twitter.com', port=443): Max
retries exceeded with url: /1/users/lookup.json?user_id=237552390%2C101237516%2C208192270%2C340183853%2C221203257%2C15254297%2C44614426%2C617136931%2C415810340%2C76071717%2C17351462%2C574253%2C35048243%2C388547381%2C254329657%2C65585979%2C253580293%2C392741693%2C126403390%2C300467007%2C8962882%2C21545799%2C15254346%2C141083469%2C340312913%2C44614485%2C600359770%2C17351519%2C38323042%2C21545828%2C86557546%2C90751854%2C128500592%2C115917681%2C42517364%2C34128760%2C15254397%2C453559166%2C92849025%2C600359811%2C17351556%2C8962952%2C296038349%2C325503810%2C122209166%2C123827693%2C59294611%2C19448725%2C21545881%2C17351581%2C130468677%2C80266144%2C15254434%2C84680859%2C65586084%2C19448741%2C15254438%2C214483879%2C48808878%2C88654768%2C15474846%2C48808887%2C334021563%2C60214090%2C134792126%2C15254464%2C558416833%2C138986435%2C264815556%2C63488965%2C17222476%2C537445328%2C97854214%2C255598755%2C65586132%2C36226009%2C187220954%2C257346383%2C15254493%2C554222558%2C302564320%2C59165520%2C44614626%2C76071907%2C80266213%2C325503825%2C403227628%2C20368210%2C17351666%2C88654836%2C340313077%2C151569400%2C302564345%2C118014971%2C11060222%2C233229141%2C13727232%2C199803906%2C220435108%2C268531201
While trying to get details of 1,000,000 users, I get this error – usually 10-‐6 AM PST Got around by “Trap & wait 5 seconds” Night Runs are relatively error free
• { • … • "date": "Fri, 06 Jul 2012 03:41:09 GMT", • "expires": "Fri, 06 Jul 2012 03:46:09 GMT", • "server": "tfe", • "set-‐cookie": "dnt=; domain=.twitter.com; path=/; expires=Thu, 01-‐Jan-‐1970 00:00:00 GMT", • "status": "400 Bad Request", • "vary": "Accept-‐Encoding", • "x-‐ratelimit-‐class": "api_identified", • "x-‐ratelimit-‐limit": "350", • "x-‐ratelimit-‐remaining": "0", • "x-‐ratelimit-‐reset": "1341546334", • "x-‐runtime": "0.01918" • } • Error, sleeping • { • … • "date": "Fri, 06 Jul 2012 03:46:12 GMT", • … • "status": "200 OK", • … • "x-‐ratelimit-‐class": "api_identified", • "x-‐ratelimit-‐limit": "350", • "x-‐ratelimit-‐remaining": "349", • … • }
Missed by 4 min!
OK after 5 min sleep
A Day in the life of �Twitter Rate Limit �
Strategies I have no exotic strategies, so far ! 1. Obvious : Track elapsed time & sleep when rate limit kicks in 2. Combine authenticated & non-‐authenticated calls 3. Use multiple API types 4. Cache 5. Store & get only what is needed 6. Checkpoint & buffer request commands 7. Distributed data parallelism – for example AWS instances http://www.epochconverter.com/ <-‐ useful to debug the timer Pl share your tips and tricks for conserving the Rate Limit
Authentication
Authentication • Three modes
o Anonymous o HTTP Basic Auth o OAuth
• As of Aug 31, 2010, only Anonymous or OAuth are supported
• OAuth enables the user to authorize an application without sharing credentials
• Also has the ability to revoke • Twitter supports OAuth 1.0a • OAuth 2.0 is the new standard, much simpler
o No timeframe for Twitter support, yet
OAuth Pragmatics • Helpful Links
o https://dev.twitter.com/docs/auth/oauth o https://dev.twitter.com/docs/auth/moving-‐from-‐basic-‐auth-‐to-‐oauth o https://dev.twitter.com/docs/auth/oauth/single-‐user-‐with-‐examples o http://blog.andydenmark.com/2009/03/how-‐to-‐build-‐oauth-‐consumer.html
• Discussion on OAuth internal mechanisms is better left for another day
• For headless applications to get OAuth token, go to https://dev.twitter.com/apps
• Create an application & get four credential pieces o Consumer Key, Consumer Secret, Access Token & Access Token Secret
• All the frameworks have support for OAuth. So plug –in these values & use the framework’s calls
• I used request-‐oauth library like so:
request-‐‑oauth def get_oauth_client():
consumer_key = "5dbf348aa966c5f7f07e8ce2ba5e7a3badc234bc" consumer_secret = "fceb3aedb960374e74f559caeabab3562efe97b4" access_token = "df919acd38722bc0bd553651c80674fab2b465086782Ls" access_token_secret = "1370adbe858f9d726a43211afea2b2d9928ed878" header_auth = True oauth_hook = OAuthHook(access_token, access_token_secret, consumer_key, consumer_secret, header_auth) client = requests.session(hooks={'pre_request': oauth_hook}) return client
def get_followers(user_id): url = 'https://api.twitter.com/1/followers/ids.json’ payload={"user_id":user_id} # if cursor is needed {"cursor":-‐1,"user_id":scr_name} r = requests.get(url, params=payload)
def get_followers_with_oauth(user_id,client): url = 'https://api.twitter.com/1/followers/ids.json' payload={"user_id":user_id} # if cursor is needed {"cursor":-‐1,"user_id":scr_name} r = client.get(url, params=payload)
Use the client instead of requests
Get client using the token, key & secret from dev.twitter.com/apps
Ref: h5p://pypi.python.org/pypi/requests-‐‑oauth
OAuth Authorize screen • The user
authenticates with Twitter & grants access to Forbes Social
• Forbes social doesn’t have the users credentials, but uses OAuth to access the user’s account
HTTP Status Codes
HTTP status Codes • 0 Never made it to Twitter Servers -‐
Library error
• 200 OK • 304 Not Modified • 400 Bad Request
o Check error message for explanation o REST Rate Limit !
• 401 UnAuthorized o Beware – you could get this for other
reasons as well.
• 403 Forbidden o Hit Update Limit (> max Tweets/day,
following too many people)
• 404 Not Found • 406 Not Acceptable • 413 Too Long • 416 Range Unacceptable • 420 Enhance Your Calm
o Rate Limited • 500 Internal Server Error • 502 Bad Gateway
o Down for maintenance • 503 Service Unavailable
o Overloaded “Fail whale” • 504 Gateway Timeout
o Overloaded
h5ps://dev.twi5er.com/docs/error-‐‑codes-‐‑responses
HTTP Status Code -‐‑ Example • { • "cache-‐control": "no-‐cache, max-‐age=300", • "content-‐encoding": "gzip", • "content-‐length": "91", • "content-‐type": "application/json; charset=utf-‐8", • "date": "Sat, 23 Jun 2012 00:06:56 GMT", • "expires": "Sat, 23 Jun 2012 00:11:56 GMT", • "server": "tfe", • … • "status": "401 Unauthorized", • "vary": "Accept-‐Encoding", • "www-‐authenticate": "OAuth realm=\"https://api.twitter.com\"", • "x-‐ratelimit-‐class": "api", • "x-‐ratelimit-‐limit": "0", • "x-‐ratelimit-‐remaining": "0", • "x-‐ratelimit-‐reset": "1340413616", • "x-‐runtime": "0.01997" • } • { • "errors": [ • { • "code": 53, • "message": "Basic authentication is not supported" • } • ] • }
Detailed error message in JSON !
I like this
HTTP Status Code – Confusing Example • { • … • "pragma": "no-‐cache", • "server": "tfe", • … • "status": "404 Not Found", • … • } • { • "errors": [ • { • "code": 34, • "message": "Sorry, that page does not exist" • } • ] • }
• GET https://api.twitter.com/1/users/lookup.json?screen_nme=twitterapi,twitter&include_entities=true
• Spelling Mistake o Should be screen_name
• But confusing error ! • Should be 406 Not Acceptable or 413 Too Long ,
showing parameter error
HTTP Status Code -‐‑ Example • { • "cache-‐control": "no-‐cache, no-‐store, must-‐revalidate, pre-‐check=0, post-‐check=0", • "content-‐encoding": "gzip", • "content-‐length": "112", • "content-‐type": "application/json;charset=utf-‐8", • "date": "Sat, 23 Jun 2012 01:23:47 GMT", • "expires": "Tue, 31 Mar 1981 05:00:00 GMT", • … • "status": "401 Unauthorized", • "www-‐authenticate": "OAuth realm=\"https://api.twitter.com\"", • "x-‐frame-‐options": "SAMEORIGIN", • "x-‐ratelimit-‐class": "api", • "x-‐ratelimit-‐limit": "150", • "x-‐ratelimit-‐remaining": "147", • "x-‐ratelimit-‐reset": "1340417742", • "x-‐transaction": "d545a806f9c72b98" • } • { • "error": "Not authorized", • "request": "/1/statuses/user_timeline.json?user_id=12%2C15%2C20" • }
Sometimes, the errors are not correct. I got this error for user_timeline.json w/ user_id=20,15,12 Clearly a parameter error (i.e. more parameters)
Objects
Users
Tweets
TimeLine
Friends Followers
Status Update Entities
Temporally Ordered
Follow Are Followed By
# media
hashtags
urls
user_mentions
embed
embed @
Places
Twitter Platform Objects
h5ps://dev.twi5er.com/docs/platform-‐‑objects
Tweets • A.k.a Status Updates • Interesting fields
o Coordinates <-‐ geo location o created_at o entities (will see later) o Id, id_str o possibly sensitive o user (will see later)
• perspectival attributes embedded within a child object of an unlike parent – hard to maintain at scale
• https://dev.twitter.com/docs/faq#6981
o withheld_in_countries • https://dev.twitter.com/blog/new-‐withheld-‐content-‐fields-‐api-‐responses
h5ps://dev.twi5er.com/docs/platform-‐‑objects/tweets
A word about id, id_str • June 1, 2010
o Snowflake the id generator service o “The full ID is composed of a timestamp,
a worker number, and a sequence number”
o Had problems with JavaScript to handle numbers > 53 bits
o “id”:819797 o “id_str”:”819797”
h5p://engineering.twi5er.com/2010/06/announcing-‐‑snowflake.html h5ps://groups.google.com/forum/?fromgroups#!topic/twi5er-‐‑development-‐‑talk/ahbvo3VTIYI h5ps://dev.twi5er.com/docs/twi5er-‐‑ids-‐‑json-‐‑and-‐‑snowflake
Tweets -‐‑ example • Let us run oscon2012-‐tweets.py • Example of tweet
o coordinates o id o id_str
Users • followers_count • geo_enabled • Id, Id_str • name, screen_name • Protected • status, statuses_count • withheld_in_countries
h5ps://dev.twi5er.com/docs/platform-‐‑objects/users
Users – Let us run some examples • Run
o oscon_2012_users.py • Lookup users by screen_name
o oscon12_first_20_ids.py • Lookup users by user_id
• Inspect the results o id, name, status, status_count, protected, followers
(for top 10 followers), withheld users
• Can use information for customizing the user’s screen in your web app
Entities • Metadata & Contextual Information • You can parse them, but Entities
parse them out as structured data • REST API/Search API –
include_entities=1 • Streaming API – included by default • hashtags, media, urls,
user_mentions h5ps://dev.twi5er.com/docs/platform-‐‑objects/entities h5ps://dev.twi5er.com/docs/tweet-‐‑entities h5ps://dev.twi5er.com/docs/tco-‐‑url-‐‑wrapper
Entities • Run
o oscon2012_entities.py
• Inspect hashtags, urls et al
Places • attributes • bounding_box • Id (as a string!) • country • name
h5ps://dev.twi5er.com/docs/platform-‐‑objects/places h5ps://dev.twi5er.com/docs/about-‐‑geo-‐‑place-‐‑a5ributes
Places • Can search for tweets near a place like so: • Get latlong of conv center [45.52929,-‐122.66289]
o Tweets near that place
• Tweets near San Jose [37.395715,-‐122.102308] • We will not see further here. But very useful
Timelines • Collections of tweets ordered by time • Use max_id & since_id for navigation
h5ps://dev.twi5er.com/docs/working-‐‑with-‐‑timelines
Other Objects & APIs • Lists • Notifications • Friendships/exists to see if one follows the other
Users
Tweets
TimeLine
Friends Followers
Status Update Entities
Temporally Ordered
Follow Are Followed By
# media
hashtags
urls
user_mentions
embed
embed @
Places
Twitter Platform Objects
h5ps://dev.twi5er.com/docs/platform-‐‑objects
Hands-‐‑on Exercise (15 min) • Setup environment – slide #14 • Sanity Check Environment & Libraries
o oscon2012_open_this_first.py o oscon2012_rate_limit_status.py
• Get objects (show calls) o Lookup users by screen_name -‐ oscon12_users.py o Lookup users by id -‐ oscon12_first_20_ids.py o Lookup tweets -‐ oscon12_tweets.py o Get entities -‐ oscon12_entities.py
• Inspect the results • Explore a little bit • Discussion
Twi5er APIs
Twitter API
REST Streaming
Twitter REST
Core Data, Core Twitter Objects
Near-realtime, High Volume
Twitter Search
Seach & Trend
Keywords Specific User Trends
Build Profile Create/Post Tweets Reply Favorite, Re-‐‑tweet
Public Streams User Streams Site Streams
Follow users, topics, data mining
Rate Limit : 150/350 Rate Limit : Complexity & Frequency
Firehose
Twi5er REST API • https://dev.twitter.com/docs/api • What we were doing were the REST API • Request-‐Response • Anonymous or OAuth • Rate Limited :
o 150/350
Twi5er Trends • oscon2012-‐trends.py • Trends/weekly, Trends/monthly • Let us run some examples
o oscon2012_trends_daily.py o oscon2012_trends_weekly.py
• Trends & hashtags o #hashtag euro2012 o http://hashtags.org/euro2012 o http://sproutsocial.com/insights/2011/08/twitter-‐hashtags/ o http://blog.twitter.com/2012/06/euro-‐2012-‐follow-‐all-‐action-‐on-‐pitch.html o Top 10 : http://twittercounter.com/pages/100, http://twitaholic.com/
Brand Rank w/ Twi5er • Walk Through & results of following
o oscon2012_brand_01.py
• Followed 10 user-‐brands for a few days to find growth
• Brand Rank o Growth of a brand w.r.t the industry o Surge in popularity – could be due to –ve or +ve buzz. Need to understand &
correlate using Twitter APIs & metrics
• API : url='https://api.twitter.com/1/users/lookup.json'
• payload={"screen_name":"miamiheat,okcthunder,nba,uefacom,lovelaliga,FOXSoccer,oscon,clouderati,googleio,OReillyMedia"}
Brand Rank w/ Twi5er Clouderati is very stable
• Google I/O showed a spike on 6/27-‐ 6/28
• OReillyMedia shares some spike • Looking at a few days worth of
data, our best inference is that “oscon doesn’t track with googleio”
• “Clouderati doesn’t track at all”
Brand Rank w/ Twi5er Tech Brands
• FOXSoccer,UEFAcom track each other
Brand Rank w/ Twi5er World of Soccer
The numbers seldom decrease. So calculating –ve velocity will not
work OTOH, if you see a –ve velocity, investigate
• NBA, MiamiHeat, okcthunder track each other • Used % than absolute numbers to compare • The hike on 7/6 to 7/10 is interesting.
Brand Rank w/ Twi5er World of Basketball
• For some reason, all numbers are going up 7/6 thru 7/10 – except for clouderati!
• Is a rising (Twitter) tide lifting all (well, almost all) ?
Brand Rank w/ Twi5er Rising Tide …
Trivia : Search API • Search(search.twitter.com)
o Built by Summize which was acquired by Twitter in 2008
o Summize described itself as “sentiment mining”
Search API • Very simple
o GET http://search.twitter.com/search.json?q=<blah>
• Based on a search criteria • “The Twitter Search API is a dedicated API for
running searches against the real-time index of recent Tweets”
• Recent = Last 6-‐9 days worth of tweets • Anonymous Call • Rate Limit
o Not No. of calls/hour, but Complexity & Frequency h5ps://dev.twi5er.com/docs/using-‐‑search h5ps://dev.twi5er.com/docs/api/1/get/search
Search API • Filters
o Search URL encoded o @ = %40, #=%23 o emoticons :) and :(, o http://search.twitter.com/search.atom?q=sometimes+%3A) o http://search.twitter.com/search.atom?q=sometimes+%3A(
• Location Filters, date filters • Content searches
Streaming API • Not request response; but stream • Twitter frameworks have the support • Rate Limit : Upto 1% • Stall warning if the client is falling behind • Good Documentation Links
o https://dev.twitter.com/docs/streaming-‐apis/connecting o https://dev.twitter.com/docs/streaming-‐apis/parameters o https://dev.twitter.com/docs/streaming-‐apis/processing
Firehose • ~ 400 million public tweets/day • If you are working with Twitter firehose, I envy you !
• If you hit real limits, then explore the firehose route • AFAIK, it is not cheap, but worth it
API Best Practices 1. Use JSON 2. Use user_id than screen_name
o User_id is constant while screen_name can change
3. max_id and since_id o For example direct messages, if you have last message use
since_id for search o max_id how far to go back
4. Cache as much as you can 5. Set the User-‐Agent header for debugging I have listed a few good blogs that have API best practices, in the reference section, at the end of this presentation
These are gathered from various books, blogs & other media, I used for this tutorial. See Reference(at the end) for the sources
Twitter API
REST Streaming
Twitter REST
Core Data, Core Twitter Objects
Near-realtime, High Volume
Twitter Search
Seach & Trend
Keywords Specific User Trends
Build Profile Create/Post Tweets Reply Favorite, Re-‐‑tweet
Public Streams User Streams Site Streams
Follow users, topics, data mining
Rate Limit : 150/350 Rate Limit : Complexity & Frequency
Firehose
Questions ?
Part II
SNA
Part II Twitter Network Analysis
1. Collect 3. Transform & Analyze
2. Store
4. Model &
Reason 5. Predict,
Recommend & Visualize
Validate Dataset & re-crawl/refresh
Tip: 1. Implement a
s
a staged p
ipeline,
never a monolit
h�
Tip: 3. Keep t
he
schema simple; don’t
be afraid to
transform�
Most important & the ugliest slide in
this deck !
Trivia • Social Network Analysis originated as Sociometry &
the social network was called a sociogram • Back then, Facebook was called SocioBinder! • Jacob Levi Morano, is considered the originator
o NYTimes, April 3, 1933, P. 17
Twi5er Networks-‐‑Definitions • Nodes
o Users o #tags
• Edges o Follows o Friends o @mentions o #tags
• Directed
Twi5er Networks-‐‑Definitions • In-‐degree
o Followers
• Out-‐Degree o Friends/Follow
• Centrality Measures • Hubs & Authorities
o Hubs/Directories tell us where Authorities are
o “Of Mortals & Celebrities” is more “Twitter-‐style”
Twi5er Networks-‐‑Properties • Concepts From Citation Networks o Cocitation
• Common papers that cite a paper • Common Followers
o C & G (Followed by F & H) o Bibliographic Coupling
• Cite the same papers • Common Friends (i.e. follow same person) o D, E, F & H
N
M
K
L
G
I
H
J
A
C
D F E
B
Twi5er Networks-‐‑Properties • Concepts From Citation Networks
o Cocitation • Common papers that cite a paper • Common Followers
o C & G (Followed by F & H) o Bibliographic Coupling
• Cite the same papers • Common Friends (i.e. follow same person)
o D, E, F & H follow C o H & F follow C & G
• So H & F have high coupling • Hence, if H follows A, we can
recommend F to follow A
N
M
K
L
G
I
H
J
A
CD
F E
B
Twi5er Networks-‐‑Properties • Bipartite/Affiliation Networks
o Two disjoint subsets o The bipartite concept is very relevant to Twitter social graph o Membership in Lists
• lists vs. users bipartite graph o Common #Tags in Tweets
• #tags vs. members bipartite graph o @mention together
• ? Can this be a bipartite graph • ? How would we fold this ?
Other Metrics & Mechanisms • Kronecker Graphs Models
o Kronecker product is a way of generating self-‐similar matrices o Prof.Leskovec et al define the Kronecker product of two graphs as the Kronecker product of
their adjacency matrices o Application : Generating models for analysis, prediction, anomaly detection et al
• Erdos-‐Renyl Random Graphs o Easy to build a Gn,p graph o Assumes equal likelihood of edges between two nodes o In a Twitter social network, we can create a more realistic expected distribution (adding the
“social reality” dimension) by inspecting the #tags & @mentions • Network Diameter • Weak Ties • Follower velocity (+ve & –ve), Association strength
o Unfollow not a reliable measure o But an interesting property to investigate when it happens
Not covered here, but potential for an encore ! Ref: Jure Leskovec: Kronecker Graphs, Random Graphs
Twi5er Networks-‐‑Properties • Twitter != LinkedIn, Twitter != Facebook • Twitter Network == Interest Network • Be cognizant of the above when you apply traditional network
properties to Twitter • For example,
o Six degrees of separation doesn't make sense (most of the time) in Twitter – except may be for Cliques
o Is diameter a reliable measure for a Twitter Network ? • Probably not
o Do cut sets make sense ? • Probably not
o But citation network principles do apply; we can learn from cliques o Bipartite graphs do make sense
Cliques (1 of 2) • “Maximal subset of the vertices in an
undirected network such that every member of the set is connected by an edge to every other”
• Cohesive subgroup, closely connected • Near-‐cliques than a perfect clique (k-‐plex i.e.
connected to at least n-‐k others) • k-‐plex clique to discover sub groups in a sparse
network; 1-‐plex being the perfect clique
Ref: Networks, An Introduction-‐‑Newman
Cliques (2 of 2) • k-‐core – at least k others in the subset; (n-‐k)-‐plex
• k-‐clique – no more than k distance away o Path inside or outside the subset o k-‐clan or k-‐club (path inside the subset)
• We will apply k-‐plex Cliques for one of our hands-‐on
Ref: Networks, An Introduction-‐‑Newman
Sentiment Analysis • Sentiment Analysis is an important & interesting work
on the Twitter platform o Collect Tweets o Opinion Estimation -‐Pass thru Classifier, Sentiment Lexicons
• Naïve Bayes/Max Entropy Class/SVM
o Aggregated Text Sentiment/Moving Average
• I chose not to dive deeper because of time constraints o Couldn’t do justice to API, Social Network and Sentiment Analysis,
all in 3 hrs
• Next 3 Slides have couple of interesting examples
Sentiment Analysis • Twitter Mining for Airline Sentiment • Opinion Lexicon -‐ +ve 2000, -‐ve 4800
h5p://www.inside-‐‑r.org/howto/mining-‐‑twi5er-‐‑airline-‐‑consumer-‐‑sentiment h5p://sentiment.christopherpo5s.net/lexicons.html#opinionlexicon
Need I say more ?
h5p://www.economist.com/blogs/schumpeter/2012/06/tracking-‐‑social-‐‑media?fsrc=scn/gp/wl/bl/moodofthemarket h5p://www.relevantdata.com/pdfs/IUStudy.pdf
“A bit of clever math can uncover interes4ng pa7erns that are not visible to the human eye”
Project Ideas
Interesting Vectors of Exploration 1. Find trending #tags & then related #tags – using
cliques over co-‐#tag-‐citation, which infers topics related to trending topics
2. Related #tag topics over a set of tweets by a user or group of users
3. Analysis-‐In/Out flow, Tweet Flow – Frequent @mention
4. Find affiliation networks by List memberships, #tags or frequent @mentions
Interesting Vectors of Exploration 5. Use centrality measures to determine mortals vs.
celebrities 6. Classify Tweet networks/cliques based on message
passing characteristics – Tweets vs. Retweets, No of reweets,…
7. Retweet Network – Measure Influence by retweet count & frequency – Information contagion by looking at different retweet
network subcomponents – who, when, how much,…
Twi5er Network Graph Analysis
An Example
Analysis Story Board • @clouderati is a popular cloud related Twitter account
• Goals: o Analyze the social graph characteristics of the users who are
following the account • Dig one level deep, to the followers & friends, of the
followers of @clouderati o How many cliques ? How strong are they ? o Does the @mention support the clique inferences ? o What are the retweet characteristics ? o How does the #tag network graph look like ?
In this tutorial
For you to explore !!
Twi5er Analysis Pipeline Story Board Stages, Strategies, APIs & Tasks
Stage 3
o Get distinct user list applying the set(union(list)) operation
Stage 4
o Get & Store User details (distinct user list)
o Unroll
Stage 5 o For each @clouderati
follower o Find friend=follower -‐ set
intersection
Stage 6
o Create social gra
ph
o Apply network theo
ry
o Infer cliques &
other
properties
Note: Needed a command buffer to manage scale (~980,000 users)
Note: Unroll stage took time & missteps
@clouderati Twi5er Social Graph • Stats (Retrospect after the runs):
o Stage 1 • @clouderati has 2072 followers
o Stage 2 • Limiting followers to 5,000 per user
o Stage 3 • Digging 1st level (set union of followers & friends of the
followers of @clouderati) explodes into ~980,000 distinct users
o MongoDB of the cache and intermediate datasets ~10 GB o The database was hosted at AWS (Hi Mem XLarge – m2.xlarge ), 8
X 15 GB, Raid 10, opened to Internet with DB authentication
Code & Run Walk Through o Code:
§ oscon_2012_user_list_spider_01.py
o Challenges: § Nothing fancy § Get the record and store § Would have had to recurse through a REST
cursor if there were more than 5000 followers § @clouderati has 2072 followers
o Interesting Points:
Stage 1
o Get @clouderati Followers o Store in MongoDB
Code & Run Walk Through o Code:
§ oscon_2012_user_list_spider_02.py § oscon_2012_twitter_utils.py § oscon_2012_mongo.py § oscon_2012_validate_dataset.py
o Challenges: § Multiple runs, errors et al !
o Interesting Points: § Set operation between two mongo collections for restart buffer § Protected users, some had 0 followers, or 0 friends § Interesting operations for validate, re-‐crawl and refresh § Added “status_code” to differentiate protected users
§ {'$set': {'status_code': '401 Unauthorized,401 Unauthorized'}} § Getting friends & followers of 2000 users is the hardest (or so I thought,
until I got through the next stage!)
Stage 2
o Crawl 1 level deep o Get friends & followers o Validate, re-‐crawl & refresh
Validate-‐‑Recrawl-‐‑Refresh Logs • pymongo version = 2.2 • Connected to DB! • … • 2075 • Error Friends : <type 'exceptions.KeyError'> • 4ff3cd40e5557c00c7000000 -‐ none has 2072 followers & 0 friends • Error Friends : <type 'exceptions.KeyError'> • 4ff3a958e5557cfc58000000 -‐ none has 2072 followers & 0 friends • Error Friends : <type 'exceptions.KeyError'> • 4ff3ccdee5557c00b6000000 -‐ none has 2072 followers & 0 friends • 4ff3d3b9e5557c01b900001e -‐ 371187804 has 0 followers & 0 friends • 4ff3d3d8e5557c01b9000048 -‐ 63488295 has 155 followers & 0 friends • 4ff3d3d9e5557c01b9000049 -‐ 342712617 has 0 followers & 0 friends • 4ff3d3d9e5557c01b900004a -‐ 21266738 has 0 followers & 0 friends • 4ff3d3dae5557c01b900004b -‐ 204652853 has 0 followers & 0 friends • … • 4ff475cfe5557c1657000074 -‐ 258944989 has 0 followers & 0 friends • 4ff475d3e5557c165700007d -‐ 327286780 has 0 followers & 0 friends • Looks like we have 132 not so good records • Elapsed Time = 0.546846
o 1st run – 132 bad records o This is the classic Erlang-‐style
supervisor o The crawl continue on transport errors
without worrying about retry o Validate will recrawl & refresh as
needed
Code & Run Walk Through o Code:
§ oscon2012_analytics_01.py
o Challenges: o Figure out the right Set operations
o Interesting Points: § 973,323 unique users ! § Recursively apply set union over 400,00 lists § Set operations took slightly more than a minute
Stage 3
o Get distinct user list applying the set(union(list)) operation
Code & Run Walk Through o Code:
§ oscon2012_analytics_01.py (focus on cmd string creation) § oscon2012_get_user_info_01.py § oscon2012_unroll_user_list_01.py § oscon2012_unroll_user_list_02.py
o Challenges: § Where do I start ?
• In the next few slides § Took me a few days to get it right (along with my daily job!) § Unfortunately I did not employ parallelism & didn’t use my
MacPro with 32 GB memory. So the runs were long § But learned hard lessons on check point & restart
o Interesting Points: § Tracking Control Numbers § Time … Marathon unroll run 19:33:33 !
Stage 4
o Get & Store User details (distinct user list)
o Unroll
Twi5er @ scale Pa5ern • Challenge:
o You want to get screen names, follower counts and other details for a million users
• Problem: o No easy REST API o https://api.twitter.com/1/users/lookup.json will take 100 user_ids and give
details
• Solution: o This is a scalability challenge. Approach it like so o Create a command buffer collection in MongoDB splitting millon user_ids
into batches of 100 o Have a “done” flag initialized to 0 for checkpoint & restart o After each cmd str is executed, rest “done”:1 o For subsequent runs, ignore “done”:1. o Also helps in control number tracking
Control numbers
Control Numbers • > db.t_users_info.count() • 8122 • > db.api_str.count({"done":0,"seq_no":{"$lt":8185}},{"seq_no”:) • 63 • > db.api_str.find({"done":0,"seq_no":{"$lt":8185}},{"seq_no":1}) • { "_id" : ObjectId("4ff4daeae5557c28bf001d53"), "seq_no" : 5433 } • { "_id" : ObjectId("4ff4daeae5557c28bf001d59"), "seq_no" : 5439 } • { "_id" : ObjectId("4ff4daeae5557c28bf001d5f"), "seq_no" : 5445 } • { "_id" : ObjectId("4ff4daebe5557c28bf001d74"), "seq_no" : 5466 } • { "_id" : ObjectId("4ff4daece5557c28bf001d7a"), "seq_no" : 5472 } • { "_id" : ObjectId("4ff4daece5557c28bf001d80"), "seq_no" : 5478 } • { "_id" : ObjectId("4ff4daede5557c28bf001d90"), "seq_no" : 5494 } • { "_id" : ObjectId("4ff4daefe5557c28bf001daf"), "seq_no" : 5525 } • { "_id" : ObjectId("4ff4daf0e5557c28bf001dba"), "seq_no" : 5536 } • { "_id" : ObjectId("4ff4daf1e5557c28bf001dcf"), "seq_no" : 5557 } • { "_id" : ObjectId("4ff4daf2e5557c28bf001de9"), "seq_no" : 5583 } • { "_id" : ObjectId("4ff4daf2e5557c28bf001def"), "seq_no" : 5589 } • { "_id" : ObjectId("4ff4daf4e5557c28bf001e0e"), "seq_no" : 5620 } • { "_id" : ObjectId("4ff4daf4e5557c28bf001e14"), "seq_no" : 5626 } • { "_id" : ObjectId("4ff4daf6e5557c28bf001e2e"), "seq_no" : 5652 } • { "_id" : ObjectId("4ff4daf6e5557c28bf001e39"), "seq_no" : 5663 } • { "_id" : ObjectId("4ff4daf8e5557c28bf001e62"), "seq_no" : 5704 } • { "_id" : ObjectId("4ff4dafae5557c28bf001e77"), "seq_no" : 5725 } • { "_id" : ObjectId("4ff4dafae5557c28bf001e81"), "seq_no" : 5735 } • { "_id" : ObjectId("4ff4dawe5557c28bf001e9b"), "seq_no" : 5761 } • Type "it" for more • > it • { "_id" : ObjectId("4ff4dafce5557c28bf001ea6"), "seq_no" : 5772 } • { "_id" : ObjectId("4ff4dafce5557c28bf001eac"), "seq_no" : 5778 } • { "_id" : ObjectId("4ff4dafde5557c28bf001eb7"), "seq_no" : 5789 } • { "_id" : ObjectId("4ff4dafde5557c28bf001ebd"), "seq_no" : 5795 } • { "_id" : ObjectId("4ff4dafee5557c28bf001ec8"), "seq_no" : 5806 } • { "_id" : ObjectId("4ff4daffe5557c28bf001ed8"), "seq_no" : 5822 } • { "_id" : ObjectId("4ff4db00e5557c28bf001eed"), "seq_no" : 5843 } • { "_id" : ObjectId("4ff4db00e5557c28bf001ef3"), "seq_no" : 5849 } • { "_id" : ObjectId("4ff4db01e5557c28bf001efe"), "seq_no" : 5860 } • { "_id" : ObjectId("4ff4db01e5557c28bf001f09"), "seq_no" : 5871 } • { "_id" : ObjectId("4ff4db03e5557c28bf001f23"), "seq_no" : 5897 } • { "_id" : ObjectId("4ff4db05e5557c28bf001f47"), "seq_no" : 5933 } • { "_id" : ObjectId("4ff4db05e5557c28bf001f52"), "seq_no" : 5944 } • { "_id" : ObjectId("4ff4db06e5557c28bf001f58"), "seq_no" : 5950 } • { "_id" : ObjectId("4ff4db06e5557c28bf001f5e"), "seq_no" : 5956 } • { "_id" : ObjectId("4ff4db06e5557c28bf001f69"), "seq_no" : 5967 } • { "_id" : ObjectId("4ff4db07e5557c28bf001f74"), "seq_no" : 5978 } • { "_id" : ObjectId("4ff4db07e5557c28bf001f7f"), "seq_no" : 5989 } • { "_id" : ObjectId("4ff4db0ae5557c28bf001fa8"), "seq_no" : 6030 } • { "_id" : ObjectId("4ff4db0ae5557c28bf001fae"), "seq_no" : 6036 } • Type "it" for more • > it • { "_id" : ObjectId("4ff4db0ae5557c28bf001w9"), "seq_no" : 6047 } • { "_id" : ObjectId("4ff4db0be5557c28bf001fc4"), "seq_no" : 6058 } • { "_id" : ObjectId("4ff4db0be5557c28bf001fca"), "seq_no" : 6064 } • { "_id" : ObjectId("4ff4db0de5557c28bf001fe0"), "seq_no" : 6086 } • { "_id" : ObjectId("4ff4db0de5557c28bf001fe6"), "seq_no" : 6092 } • { "_id" : ObjectId("4ff4db0de5557c28bf001fec"), "seq_no" : 6098 } • { "_id" : ObjectId("4ff4db0ee5557c28bf002006"), "seq_no" : 6124 } • { "_id" : ObjectId("4ff4db10e5557c28bf002025"), "seq_no" : 6155 } • { "_id" : ObjectId("4ff4db12e5557c28bf002044"), "seq_no" : 6186 } • { "_id" : ObjectId("4ff4db12e5557c28bf00204a"), "seq_no" : 6192 } • { "_id" : ObjectId("4ff4db1ae5557c28bf0020e0"), "seq_no" : 6342 } • { "_id" : ObjectId("4ff4db1ae5557c28bf0020e1"), "seq_no" : 6343 } • { "_id" : ObjectId("4ff4db2ee5557c28bf002240"), "seq_no" : 6694 } • { "_id" : ObjectId("4ff4db34e5557c28bf0022b9"), "seq_no" : 6815 } • { "_id" : ObjectId("4ff4db41e5557c28bf00239f"), "seq_no" : 7045 } • { "_id" : ObjectId("4ff4db53e5557c28bf0024fe"), "seq_no" : 7396 } • { "_id" : ObjectId("4ff4db66e5557c28bf00265d"), "seq_no" : 7747 } • { "_id" : ObjectId("4ff4db68e5557c28bf002678"), "seq_no" : 7774 } • { "_id" : ObjectId("4ff4db6be5557c28bf0026af"), "seq_no" : 7829 } • >
The collection should have 8185 documents But it has only 8122. Where did the rest go ?
63 of them still have done=0 8122 + 63 = 8185 ! Aha, mystery solved. They fell through the cracks Need a catch-‐all final run
Day in the life of a Control Number Detective – Run #1 • Remember : 973,323 users. So, 9734 cmd strings (100 users perstring) • > > db.api_str.count() • 9831 • > db.api_str.count({"done":0}) • 239
• >> db.t_users_info.count() • 9592 • > > db.api_str.count({"api_str":""}) • 97 • So we should have 9831 – 97 = 9734 records • The second run should generate 9734-‐9592 = 142 calls (i.e. 350-‐142=208 rate-‐limit should remain). Let us see. • { • …
• "x-‐ratelimit-‐class": "api_identified", • "x-‐ratelimit-‐limit": "350", • "x-‐ratelimit-‐remaining": "209", • … • } • Yep, 209 left • >
Day in the life of a Control Number Detective – Run #2 • Remember : 973,323 users. So, 9734 cmd strings (100 users perstring) • > db.t_users_info.count() • 9728 • > db.api_str.count({"api_str":""}) • 97
• > db.api_str.count({"done":0}) • 103 • >9734-‐9728=6, same as 103-‐97 ! • Run once more ! • > db.api_str.find({"done":0},{"seq_no":1}) • … • { "_id" : ObjectId("4ff4dbd4e5557c28bf002e22"), "seq_no" : 9736 } • { "_id" : ObjectId("4ff4db05e5557c28bf001f47"), "seq_no" : 5933 }
• { "_id" : ObjectId("4ff4db8be5557c28bf0028f6"), "seq_no" : 8412 } • { "_id" : ObjectId("4ff4dba2e5557c28bf002a8c"), "seq_no" : 8818 } • { "_id" : ObjectId("4ff4dbaee5557c28bf002b69"), "seq_no" : 9039 } • { "_id" : ObjectId("4ff4dbb8e5557c28bf002c1c"), "seq_no" : 9218 } • …
• { • … • "x-‐ratelimit-‐limit": "350", • "x-‐ratelimit-‐remaining": "344", • …
• } • Yep, 6 more records • > db.t_users_info.count() • 9734
• Good, got 9734 !
Professor Layton would be proud !
In fact, I have all the four & plan to spend sometime with them & Laphraig !
Monitor runs & track control numbers
Unroll run 8:48 PM to ~4:08 PM next day !
Track error & the document numbers
Code & Run Walk Through o Code:
§ oscon2012_find_strong_ties_01.py § oscon2012_social_graph_stats_01.py
o Challenges: § None. Python set operations made this easy
o Interesting Points: § Even at this scale, single machine is not enough § Should have tried data parallelism
• This task is well suited to leverage data parallelism as it is commutative & associative
• Was getting invalid cursor error from MongoDB • So had to do the updates in two steps
Stage 5
o For each @clouderati follower
o Find friend=follower -‐ set intersection
Code & Run Walk Through o Code:
§ oscon2012_find_cliques_01.py
o Challenges: o Lots of good information hidden in
the data ! o Memory !
o Interesting Points: o Graph, List & set operations o networkx has lots of interesting
graph algorithms o Collections.Counter to the rescue
Stage 6
o Create social graph o Apply network theory o Infer cliques & other
properties
Twi5er Social Graph Analysis of @clouderati
o 2072 Followers; 973,323 unique users one level down w/ followers/friends trimmed at 5,000
o Strong ties o follower=friend
o 235,697 users, 462, 419 edges o 501,367 Cliques o 253 unique users 8,906 Cliques w/ >
10 users o GeorgeReese in 7,973 of them ! See
List for 1st 125 o krishnan 3,446,randy 2,197, joe 1,977,
sam 1,937, jp 485, stu 403, urquhart 263,beaker 226, acroll 149, adrian 63, gevaperry 24
o Of course, clique analysis does not tell us the whole story …
Clique Distribution = {2: 296521, 3: 58368, 4: 36421, 5: 28788, 6: 24197, 7: 20240, 8: 15997, 9: 11929, 10: 6576, 11: 1909, 12: 364, 13: 55, 14: 2}
Twi5er Social Graph Analysis of @clouderati
o sort by followers vs. sort by strong ties is interesting
Celebrity – very low strong ties
Medium Celebrity, medium strong ties
Higher Celebrity, low strong ties
Twi5er Social Graph Analysis of @clouderati o A higher “Strong Ties”
number is interesting § It means a very high
follower-‐friend intersection
§ Reeves 62%, bgolden 85%
o Bur a high clique with a smaller “Strong ties” show more cohesive & stronger social graph § eg.Krishnan -‐ 15%
friends-‐followers § Samj – 33%
Twi5er Social Graph Analysis of @clouderati
o Ideas for more Exploration § Include all
followers (instead of stopping at the 5000 cap)
§ Get tweets & track @mention
§ Frequent @mention shows more stronger ties
§ #tag analysis could show some interesting networks
Twitter Tips – A Baker’s Dozen 1. Twitter APIs are (more or less) congruent & symmetric 2. Twitter is usually right & simple -‐ recheck when you get unexpected results
before blaming Twitter o I was getting numbers when I was expecting screen_names in user objects. o Was ready to send blasting e-‐mails to Twitter team. Decided to check one more time
and found that my parameter key was wrong-‐screen_name instead of user_id o Always test with one or two records before a long run ! - learned the hard way
3. Twitter APIs are very powerful – consistent use can bear huge data o In a week, you can pull in 4-‐5 million users & some tweets ! o Night runs are far more faster & error-free
4. Use a NOSQL data store as a command buffer & data buffer o Would make it easy to work with Twitter at scale o I use MongoDB o Keep the schema simple & no fancy transformation
• And as far as possible same as the (json) response o Use NOSQL CLI for trimming records et al
The Beginning As The End
Twitter Tips – A Baker’s Dozen 5. Always use a big data pipeline
o Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize o That way you can orthogonally extend, with functional components like command buffers,
validation et al 6. Use functional approach for a scalable pipeline
o Compose your data big pipeline with well defined granular functions, each doing only one thing o Don’t overload the functional components (i.e. no collect, unroll & store as a single component) o Have well defined functional components with appropriate caching, buffering, checkpoints &
restart techniques • This did create some trouble for me, as we will see later
7. Crawl-‐Store-‐Validate-‐Recrawl-‐Refresh cycle o The equivalent of the traditional ETL o Validation stage & validation routines are important
• Cannot expect perfect runs • Cannot manually look at data either, when data is at scale
8. Have control numbers to validate runs & monitor them o I still remember control numbers which start with the number of punch cards in the input deck &d then follow that
number through the various runs ! o There will be a separate printout of the control numbers that will be kept in the operations files
Twitter Tips – A Baker’s Dozen 9. Program defensively
o more so for a REST-based-Big Data-Analytics systems o Expect failures at the transport layer & accommodate for them
10. Have Erlang-‐style supervisors in your pipeline o Fail fast & move on o Don’t linger and try to fix errors that cannot be controlled at that layer o A higher layer process will circle back and do incremental runs to
correct missing spiders and crawls o Be aware of visibility & lack of context. Validate at the lowest layer that
has enough context to take corrective actions o I have an example in part 2
11. Data will never be perfect o Know your data & accommodate for it’s idiosyncrasies
• for example: 0 followers, protected users, 0 friends,…
Twitter Tips – A Baker’s Dozen 12. Check Point frequently (preferably after ever API call) & have a
re-‐startable command buffer cache o See a MongoDB example in Part 2
13. Don’t bombard the URL o Wait a few seconds before successful calls. This will end up with a
scalable system, eventually o I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to
work with 5 seconds with wait & retry. Then, the rate limit started kicking in ! 14. Always measure the elapsed time of your API runs & processing
o Kind of early warning when something is wrong
15. Develop incrementally; don’t fail to check “cut & paste” errors
Twitter Tips – A Baker’s Dozen 16. The Twitter big data pipeline has lots of opportunities for parallelism
o Leverage data parallelism frameworks like MapReduce o But first :
§ Prototype as a linear system, § Optimize and tweak the functional modules & cache strategies, § Note down stages and tasks that can be parallelized and § Then parallelize them
o For the example project, we will see later, I did not leverage any parallel frameworks, but the opportunities were clearly evident. I will point them out, as we progress through the tutorial
17. Pay attention to handoffs between stages o They might require transformation – for example collect & store might store a user list
as multiple arrays, while the model requires each user to be a document for aggregation
o But resist the urge to overload collect with transform o i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform
the array to separate documents o Add transformation as a granular function – of course, with appropriate buffering, caching,
checkpoints & restart techniques 18. Have a good log management system to capture and wade through
logs
Twitter Tips – A Baker’s Dozen 19. Understand the underlying network characteristics for the
inference you want to make o Twitter Network != Facebook Network , Twitter Graph != LinkedIn Graph
o Twitter Network is more of an Interest Network o So, many of the traditional network mechanisms & mechanics, like network
diameter & degrees of separation, might not make sense o But, others like Cliques and Bipartite Graphs do
Twitter Gripes 1. Need more rich APIs for #tags
o Somewhat similar to users viz. followers, friends et al o Might make sense to make #tags a top level object with it’s own semantics
2. HTTP Error Return is not uniform o Returns 400 bad Request instead of 420 o Granted, there is enough information to figure this out
3. Need an easier way to get screen_name from user_id 4. “following” vs. “friends_count” i.e. “following” is a dummy variable.
o There are a few like this, most probably for backward compatibility 5. Parameter Validation is not uniform
o Gives “404 Not found” instead of “406 Not Acceptable” or “416 Range Unacceptable”
6. Overall more validation would help o Granted, it is more of growing pains. Once one comes across a few inconsistencies, the
rest is easy to figure out
Thanks To these Giants …
Thanks To these Giants …
Thanks To these Giants …
Thanks To these Giants …
Thanks To these Giants …
I had a good time researching &
preparing for this Tutorial. ���
I hope you learned a few new things &
have a few vectors to follow