BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6...
-
Upload
josie-bourne -
Category
Documents
-
view
214 -
download
0
Transcript of BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6...
BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database
2013 Open Seminar Series 6Open Geospatial Informatics
Cheng-Ying Liu (Sean)[email protected]
http://bermuda.citi.sinica.edu.tw
BIG Data & Twitter
WHAT IS BIG DATA ?
In information technology, big data is a loosely-defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools.
《Wikipedia Big data》
Source: http://en.wikipedia.org/wiki/Big_data
WHAT IS BIG DATA ?
• In 2001, Doug Laney use 3V model to describe Big Data‒ Volume: amount of data‒ Velocity: speed of data in and out‒ Variety: range of data types and sources‒ Veracity: truth or fact of data
WHAT IS BIG DATA ?
• In 2012, Gartner updated the definition – Still advocate 3V model for describing data– Require new forms of processing– Enhanced decision making– Insight discovery– Process optimization
HOW BIG IS BIG DATA ?
• Beyond the ability of commonly used• A few dozen terabytes (107) to many petabytes (108)
− 2008: Google processes 20 PB a day− 2009: Facebook has 2.5 PB user data + 15 TB/day− 2009: eBay has 6.5 PB user data + 50 TB/day− 2011: Yahoo! has 180-200 PB of data− 2012: Facebook ingests 500 TB/day
NEW TECHNOLOGY FOR BIG DATA
• Hadoop – Developed by Apache Software Foundation– Derived from Google's MapReduce & File System– Able to process peta-bytes scale database
• NoSQL (Not Only SQL)– Relational databases is not applicable for all cases– NoSQL is a new choose for non-relational databases– Adopted by Google, Facebook, Twitter, etc.
WHAT IS TWITTER?• The fastest, simplest way to communicate• More than 140M active users• Majority source from mobile• 60% of user is out of U.S.• More than 400M twitter.com visitors• More than 400M tweets/day (peak: 25K/sec)• 1,000 employees (majority in San Francisco)• 50% of employee are engineers• Expect to hit nearly $1 billion on global ad revenue in
2014 by eMarketer
TWITTER HISTORY
• Evan Williams on the genesis of Twitter, ICWSM, April 2007: − A side project started from Jack Dorsey’s idea Oct, 2006− Wanted a ubiquitous status message− A community of people answering the question “what are
you doing?”− Exploded at SXSW, SF earthquakes (2011)− Good for collective “backchanneling”− High “Ambient intimacy”− Huge API usage was unexpected, as was the rise of the @
sign for replies
HOW BIG IS TWITTER ?
Source: http://blog.twitter.com/2011/06/200-million-tweets-per-day.html
WHAT IS TWEET ?
TWITTER TOWN HALL
July 6, 2011
Mapping the global Twitter heartbeat: The geography of Twitter, May 2013
Source: http://www.sgi.com/go/twitter/images/hires/figure4.png
TWITTER STATS
TWITTER STATS
TWITTER STATS
Source: Pew Research Center's Internet &American Life Project Winter 2012 Tracking Survey, January 20-February 19, 2012. N=2,253 adults age 18 and older, including 901 cell phone interviews. Interviews conducted in English and Spanish. The margin of error is +/-2.7 percentage points for internet users. **Represents significant difference compared with all other rows in group.
TWITTER STATS
TWITTER STATS
Twitter Dev
TWITTER ACCOUNT• Register a Twitter account (required)
REGISTER A TWITTER APPLICATION• Twitter developer web site: https://dev.twitter.com/• Select “My applications”
REGISTER A TWITTER APPLICATION• Click “Create a new application”
Application List
REGISTER A TWITTER APPLICATION• Fill the required information
1.
2.
3.
REGISTER A TWITTER APPLICATION• Agree developer rules and fill captcha
1.
2.
REGISTER A TWITTER APPLICATION• Go back to application list and click your application• Click “Settings”
REGISTER A TWITTER APPLICATION• Select “Read, Write and Access direct messages”• Click “Update this Twitter application’s settings”
REGISTER A TWITTER APPLICATION• Click “Create my access token”
REGISTER A TWITTER APPLICATION
Twitter API Resource
REST API
Source: https://dev.twitter.com/docs/streaming-apis
STREAMING API
Source: https://dev.twitter.com/docs/streaming-apis
TWEET CRAWL API
Source: https://dev.twitter.com/docs/api/1.1
Source: https://dev.twitter.com/docs/rate-limiting/1.1/limits
Resource Description Request Limit(Per User)
Request Limit(Via OAuth)
GET statuses/show/:id Returns a single Tweet, specified by the id parameter. 180 / 15 mins 180 / 15 mins
POST statuses/update Updates the authenticating user's current status, also known as tweeting. - -
GET search/tweets Returns a collection of relevant Tweets matching a specified query. 180 / 15 mins 450 / 15 mins
POST statuses/filter Returns public statuses that match one or more filter predicates. - -
GET statuses/firehoseThis endpoint requires special permission to access. Returns all public statuses.
- -
tmhOAuth LIBRARY
• Website: https://github.com/themattharris/tmhOAuth• $ git clone https://github.com/themattharris/tmhOAuth.git• Current Version 0.8.2• Author: Matt Harris @themattharris • Goal:
‒ Support OAuth 1.0A‒ Use authorization headers instead of query string or POST parameters‒ Allow uploading of images‒ Provide enough information to assist with debugging
CRAWLING WITH REST API• New a Oauth object contains authentication token
• Set parameters for API
• Use Twitter REST API to obtain tweets
CRAWLING WITH STREAMING API• New a Oauth object contains authentication token
• Set parameters for API
• Construct a connection to Twitter server
WHAT IS OAuth ?• OAuth = Open Authentication• What is OAuth:
‒ An open protocol to allow secure API authorization in a simple and standard method from desktop and web applications.
• Goal of OAuth:‒ Request token URL ‒ Authorize URL ‒ Access token URL
NORMAL SEARCH OPERATORSOperator Finds tweets...
twitter search containing both "twitter" and "search". This is the default operator.
"happy hour" containing the exact phrase "happy hour".
love OR hate containing either "love" or "hate" (or both).
beer -root containing "beer" but not "root".
#haiku containing the hashtag "haiku".
from:alexiskold sent from person "alexiskold".
to:techcrunch sent to person "techcrunch".
@mashable referencing person "mashable".
"happy hour" near: "san francisco" containing the exact phrase "happy hour" and sent near "san francisco".
near:NYC within:15mi sent within 15 miles of "NYC".
SEARCH PARAMETERS (REST)
Source: https://dev.twitter.com/docs/api/1.1/get/search/tweets
Parameter Description
q A UTF-8, URL-encoded search query of 1,000 characters maximum
geocode Returns tweets within a given radius of the given coordinates.
lang Restricts tweets to the given language, given by an ISO 639-1 code.
locale Specify the language of the query you are sending. (Only ja)
result_type Specifies from mixed, recent or popular.
count The number of tweets to return per page (<=100)
until Returns tweets generated before the given date.
since_id Returns results with an ID greater than the specified ID.
max_id Returns results with an ID less than or equal to the specified ID.
include_entities The entities node will be disincluded when set to false.
callback The response will use the JSONP format with a callback.
SEARCH PARAMETERS (STREAMING)
Source: https://dev.twitter.com/docs/api/1.1/post/statuses/filter
Parameter Description
follow Indicating the users to return statuses for in the stream.
track Keywords to track.
locations Specifies a set of bounding boxes to track.
delimited Specifies whether messages should be length-delimited.
stall_warnings Specifies whether stall warnings should be delivered.
WHAT DOES A TWEET LOOK LIKE?
CRAWLING EFFICIENCY
• Duration: May 6th to June 30th in 2012 (55 days)• REST API
– Maximum TPS : 450 100 15 60 50 (Tweet / sec)• Steaming API
– Randomly returns tweets containing a specific search keyword– The total quantity never exceeding 1% of all public data streams
KeywordStreaming API REST API
Proportion(S/R)*Total TPS Total TPS
YouTube 143,869,821 30.28 6,306,355 1.33 22.81
News 41,482,108 8.73 7,906,215 1.66 5.25
Google 28,720,525 6.04 7,474,687 1.57 3.84
Obama 8,503,834 1.79 5,271,187 1.11 1.61
*TPS: Tweet Per Second *S/R: Streaming/REST
LARGE-SCALE CRAWLING
Track Word Size Duration From – To #Tweet 1 Year
YouTube 12.0 G 21 days 2013-07-07 15:12:252013-07-28 13:10:01 52,913,498 209 G
News 5.7 G 22 days 2013-07-07 15:07:152013-07-28 13:10:00 21,894,823 95 G
Http 15.0 G 21 days 2013-07-07 15:44:132013-07-28 13:10:00 62,976,451 261 G
Apple 1.0 G 22 days 2013-07-07 15:07:20 2013-07-28 13:10:01 4,038,241 17 G
Android 4.1 G 20 days 2013-07-07 15:20:432013-07-28 13:10:00 16,605,070 75 G
Obama 682 M 22 days 2013-07-07 15:07:052013-07-28 13:10:01 2,768,149 11 G
Twitter + MySQL
SINGLE NODE CRAWLING TYPE
• Guideline for single node crawling:− Each streaming needs to authenticate itself− Total data size seems bounded (i.e. #Tweet to crawler is limited)− Prevent aggressively connecting to Twitter server− Crawling with different Twitter accounts is recommended
Tweet Crawler
Tweets Streaming - B
Twitter Server
Tweets Streaming - C
Tweets Streaming - A
…
MULTI-NODE CRAWLING TYPE
• Guideline for multi-node crawling:− Automatically check connection status− Automatically update databases summary information− Design the crawl program with well log file report function− Design a good database schema for distributed accessing
Tweet CrawlerTweets Streaming - B
Twitter Server
Tweets Streaming - A
Tweet Crawler
DESIGN TWEET TABLE Name Type Description Index Type
Id BIGINT UNSIGNED Unique index ID in database PRIMARY
tweet_id BIGINT UNSIGNED Official Tweet ID UNIQUE
text VARCHAR( 150 ) Tweet content -
screen_name VARCHAR( 255 ) User screen name -
user_id BIGINT UNSIGNED User ID -
followers_count INT Number of followers -
friends_count INT Number of friends -
created_at DATETIME Tweet create time -
language VARCHAR( 5 ) Language to Tweet -
source VARCHAR( 150 ) Device or browser to Tweet -
urls_count INT Number of URL in the Tweet -
SETTING ENVIRONMENT
• Install packages‒ # apt-get install php5 php5-curl ‒ # apt-get install mysql-client mysql-server‒ # apt-get install phpmyadmin ‒ Set Apache2 as web server when install phpymadmin
SETTING ENVIRONMENT• Create databsase and table for Tweet crawling
− Create a *.sql file for database format− Change directory to that file− # mysql -h {$HOST} -u {$USER} -p{$PASSWORD}− mysql> \. {$SQL_FILE}
SETTING ENVIRONMENT• Check the database by phpmyadmin
− Open browser and connect URL http://localhost/phpmyadmin− Select database and check the structure
CRAWLING REAL-TIME TWEETS• Connect database
• Save Tweet into database
CRAWLING REAL-TIME TWEETS
• Copy all files in twitter_watch to /var/www/twitter_watch‒ # cp twitter_watch/server.php /var/www/twitter_watch‒ # cp twitter_watch/logic.hjs /var/www/twitter_watch‒ # cp twitter_watch/index.html /var/www/twitter_watch
• Start crawling tweets‒ $ php5 watch.php
CRAWLING REAL-TIME TWEETS• Click “Browse” to show crawling Tweets in database
CRAWLING REAL-TIME TWEETS• Real-Time update Tweets by JQuery
‒ Browse http://localhost/twitter_watch/index.html
TROUBLESHOOTING• Access denied for user 'root'@'localhost' (using password: NO)‘
• # /etc/init.d/mysql stop• # mysqld_safe --skip-grant-tables &• # mysql -u root mysql• mysql> UPDATE user SET Password=PASSWORD(‘xxx') where USER='root';• mysql> FLUSH PRIVILEGES;• mysql> quit;• # /etc/init.d/mysql restart
• Be aware of time synchronization• # apt-get install ntp• # ntpdate -s time.stdtime.gov.tw• # hwclock --systohc
URL @ TweetSURLMINE
Incremental Mining of Significant URLs in Real-Time and Large-Scale Social Streams
PAKDD 2013
WHY URL?• High percentage of URLs have been embedded in Tweets
− Content length limitation and information completeness
• URL is an universal language without linguistic differences• URL is able to connect different social media platforms• Tweet with URL has been verified with low spam possibility
Social Media Character Limit Nature
Twitter 140 characters Short message
Plurk 140 characters Short message
LinkedIn 200 ~ 689 characters Job opportunities
Google+ 100,000 characters Mix information
Facebook 63,206 characters Mix information
YouTube 1,000 characters Video sharing
CHALLENGE• URL shorterners make URLs hard to be analyzed• The usage of various URL shortening services are different
• URL shorterner is time-effective which could expired anytime• A general solution to expand URL shorterner to original URL• Some of URLs link to phishing websites
Keyword original bit.ly tinyurl ow.ly goo.gl others URL %
YouTube 96.49% 0.95% 0.14% 0.10% 0.12% 2.20% 90.80%
News 37.92% 17.92% 1.10% 0.00% 2.17% 40.89% 75.77%
Google 54.49% 16.30% 0.98% 2.28% 4.12% 21.83% 60.67%
Obama 30.20% 23.33% 2.27% 2.62% 2.87% 38.71% 54.22%
EXPAND URL SHORTERNERS
• Recursively tracking web page redirections− Be aware of to be identified as DNS attack (cache table)− Redirection link may changes with various browsers
URL STATS @ TWEETTrack Word #Tweet #URL URL % URL Per Second
YouTube 529,82,166 49,975,035 94.32 % 27.62
News 21,948,837 15,572,228 70.95 % 8.60
Http 62,976,451 42,249,898 67.09 % 23.41
Apple 4,045,333 2,670,731 66.02 % 1.48
Android 16,605,070 15,242,497 91.79 % 8.44
Obama 2,771,791 950,780 34.30 % 0.53
TRACK “TAIWAN” ON TWITTER
We demand the truth and justice!
Thank You
Q & A