BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6...

61
BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) [email protected] http://bermuda.citi.sinica.edu. tw

Transcript of BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6...

Page 1: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database

2013 Open Seminar Series 6Open Geospatial Informatics

Cheng-Ying Liu (Sean)[email protected]

http://bermuda.citi.sinica.edu.tw

Page 2: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

BIG Data & Twitter

Page 3: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

WHAT IS BIG DATA ?

In information technology, big data is a loosely-defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools.

《Wikipedia Big data》

Source: http://en.wikipedia.org/wiki/Big_data

Page 4: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

WHAT IS BIG DATA ?

• In 2001, Doug Laney use 3V model to describe Big Data‒ Volume: amount of data‒ Velocity: speed of data in and out‒ Variety: range of data types and sources‒ Veracity: truth or fact of data

Page 5: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

WHAT IS BIG DATA ?

• In 2012, Gartner updated the definition – Still advocate 3V model for describing data– Require new forms of processing– Enhanced decision making– Insight discovery– Process optimization

Page 6: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

HOW BIG IS BIG DATA ?

• Beyond the ability of commonly used• A few dozen terabytes (107) to many petabytes (108)

− 2008: Google processes 20 PB a day− 2009: Facebook has 2.5 PB user data + 15 TB/day− 2009: eBay has 6.5 PB user data + 50 TB/day− 2011: Yahoo! has 180-200 PB of data− 2012: Facebook ingests 500 TB/day

Page 7: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

NEW TECHNOLOGY FOR BIG DATA

• Hadoop – Developed by Apache Software Foundation– Derived from Google's MapReduce & File System– Able to process peta-bytes scale database

• NoSQL (Not Only SQL)– Relational databases is not applicable for all cases– NoSQL is a new choose for non-relational databases– Adopted by Google, Facebook, Twitter, etc.

Page 8: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

WHAT IS TWITTER?• The fastest, simplest way to communicate• More than 140M active users• Majority source from mobile• 60% of user is out of U.S.• More than 400M twitter.com visitors• More than 400M tweets/day (peak: 25K/sec)• 1,000 employees (majority in San Francisco)• 50% of employee are engineers• Expect to hit nearly $1 billion on global ad revenue in

2014 by eMarketer

Page 9: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

TWITTER HISTORY

• Evan Williams on the genesis of Twitter, ICWSM, April 2007: − A side project started from Jack Dorsey’s idea Oct, 2006− Wanted a ubiquitous status message− A community of people answering the question “what are

you doing?”− Exploded at SXSW, SF earthquakes (2011)− Good for collective “backchanneling”− High “Ambient intimacy”− Huge API usage was unexpected, as was the rise of the @

sign for replies

Page 10: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

HOW BIG IS TWITTER ?

Source: http://blog.twitter.com/2011/06/200-million-tweets-per-day.html

Page 11: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

IT’S NOT JUST BIG! IT’S FRESH!

Source: http://xkcd.com/723/

Page 12: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

WHAT IS TWEET ?

Page 13: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

TWITTER TOWN HALL

July 6, 2011

Page 14: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

Mapping the global Twitter heartbeat: The geography of Twitter, May 2013

Source: http://www.sgi.com/go/twitter/images/hires/figure4.png

TWITTER STATS

Page 15: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

TWITTER STATS

Page 16: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

TWITTER STATS

Source: Pew Research Center's Internet &American Life Project Winter 2012 Tracking Survey, January 20-February 19, 2012. N=2,253 adults age 18 and older, including 901 cell phone interviews. Interviews conducted in English and Spanish. The margin of error is +/-2.7 percentage points for internet users. **Represents significant difference compared with all other rows in group.

Page 17: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

TWITTER STATS

Page 18: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

TWITTER STATS

Page 19: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

Twitter Dev

Page 20: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

TWITTER ACCOUNT• Register a Twitter account (required)

Page 21: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

REGISTER A TWITTER APPLICATION• Twitter developer web site: https://dev.twitter.com/• Select “My applications”

Page 22: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

REGISTER A TWITTER APPLICATION• Click “Create a new application”

Application List

Page 23: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

REGISTER A TWITTER APPLICATION• Fill the required information

1.

2.

3.

Page 24: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

REGISTER A TWITTER APPLICATION• Agree developer rules and fill captcha

1.

2.

Page 25: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

REGISTER A TWITTER APPLICATION• Go back to application list and click your application• Click “Settings”

Page 26: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

REGISTER A TWITTER APPLICATION• Select “Read, Write and Access direct messages”• Click “Update this Twitter application’s settings”

Page 27: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

REGISTER A TWITTER APPLICATION• Click “Create my access token”

Page 28: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

REGISTER A TWITTER APPLICATION

Page 29: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

Twitter API Resource

Page 30: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

REST API

Source: https://dev.twitter.com/docs/streaming-apis

Page 31: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

STREAMING API

Source: https://dev.twitter.com/docs/streaming-apis

Page 32: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

TWEET CRAWL API

Source: https://dev.twitter.com/docs/api/1.1

Source: https://dev.twitter.com/docs/rate-limiting/1.1/limits

Resource Description Request Limit(Per User)

Request Limit(Via OAuth)

GET statuses/show/:id Returns a single Tweet, specified by the id parameter. 180 / 15 mins 180 / 15 mins

POST statuses/update Updates the authenticating user's current status, also known as tweeting. - -

GET search/tweets Returns a collection of relevant Tweets matching a specified query. 180 / 15 mins 450 / 15 mins

POST statuses/filter Returns public statuses that match one or more filter predicates. - -

GET statuses/firehoseThis endpoint requires special permission to access. Returns all public statuses.

- -

Page 33: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

tmhOAuth LIBRARY

• Website: https://github.com/themattharris/tmhOAuth• $ git clone https://github.com/themattharris/tmhOAuth.git• Current Version 0.8.2• Author: Matt Harris @themattharris • Goal:

‒ Support OAuth 1.0A‒ Use authorization headers instead of query string or POST parameters‒ Allow uploading of images‒ Provide enough information to assist with debugging

Page 34: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

CRAWLING WITH REST API• New a Oauth object contains authentication token

• Set parameters for API

• Use Twitter REST API to obtain tweets

Page 35: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

CRAWLING WITH STREAMING API• New a Oauth object contains authentication token

• Set parameters for API

• Construct a connection to Twitter server

Page 36: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

WHAT IS OAuth ?• OAuth = Open Authentication• What is OAuth:

‒ An open protocol to allow secure API authorization in a simple and standard method from desktop and web applications.

• Goal of OAuth:‒ Request token URL ‒ Authorize URL ‒ Access token URL

Page 37: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

NORMAL SEARCH OPERATORSOperator Finds tweets...

twitter search containing both "twitter" and "search". This is the default operator.

"happy hour" containing the exact phrase "happy hour".

love OR hate containing either "love" or "hate" (or both).

beer -root containing "beer" but not "root".

#haiku containing the hashtag "haiku".

from:alexiskold sent from person "alexiskold".

to:techcrunch sent to person "techcrunch".

@mashable referencing person "mashable".

"happy hour" near: "san francisco" containing the exact phrase "happy hour" and sent near "san francisco".

near:NYC within:15mi sent within 15 miles of "NYC".

Page 38: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

SEARCH PARAMETERS (REST)

Source: https://dev.twitter.com/docs/api/1.1/get/search/tweets

Parameter Description

q A UTF-8, URL-encoded search query of 1,000 characters maximum

geocode Returns tweets within a given radius of the given coordinates.

lang Restricts tweets to the given language, given by an ISO 639-1 code.

locale Specify the language of the query you are sending. (Only ja)

result_type Specifies from mixed, recent or popular.

count The number of tweets to return per page (<=100)

until Returns tweets generated before the given date.

since_id Returns results with an ID greater than the specified ID.

max_id Returns results with an ID less than or equal to the specified ID.

include_entities The entities node will be disincluded when set to false.

callback The response will use the JSONP format with a callback.

Page 39: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

SEARCH PARAMETERS (STREAMING)

Source: https://dev.twitter.com/docs/api/1.1/post/statuses/filter

Parameter Description

follow Indicating the users to return statuses for in the stream.

track Keywords to track.

locations Specifies a set of bounding boxes to track.

delimited Specifies whether messages should be length-delimited.

stall_warnings Specifies whether stall warnings should be delivered.

Page 40: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

WHAT DOES A TWEET LOOK LIKE?

Page 41: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

CRAWLING EFFICIENCY

• Duration: May 6th to June 30th in 2012 (55 days)• REST API

– Maximum TPS : 450 100 15 60 50 (Tweet / sec)• Steaming API

– Randomly returns tweets containing a specific search keyword– The total quantity never exceeding 1% of all public data streams

KeywordStreaming API REST API

Proportion(S/R)*Total TPS Total TPS

YouTube 143,869,821 30.28 6,306,355 1.33 22.81

News 41,482,108 8.73 7,906,215 1.66 5.25

Google 28,720,525 6.04 7,474,687 1.57 3.84

Obama 8,503,834 1.79 5,271,187 1.11 1.61

*TPS: Tweet Per Second *S/R: Streaming/REST

Page 42: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

LARGE-SCALE CRAWLING

Track Word Size Duration From – To #Tweet 1 Year

YouTube 12.0 G 21 days 2013-07-07 15:12:252013-07-28 13:10:01 52,913,498 209 G

News 5.7 G 22 days 2013-07-07 15:07:152013-07-28 13:10:00 21,894,823 95 G

Http 15.0 G 21 days 2013-07-07 15:44:132013-07-28 13:10:00 62,976,451 261 G

Apple 1.0 G 22 days 2013-07-07 15:07:20 2013-07-28 13:10:01 4,038,241 17 G

Android 4.1 G 20 days 2013-07-07 15:20:432013-07-28 13:10:00 16,605,070 75 G

Obama 682 M 22 days 2013-07-07 15:07:052013-07-28 13:10:01 2,768,149 11 G

Page 43: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

Twitter + MySQL

Page 44: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

SINGLE NODE CRAWLING TYPE

• Guideline for single node crawling:− Each streaming needs to authenticate itself− Total data size seems bounded (i.e. #Tweet to crawler is limited)− Prevent aggressively connecting to Twitter server− Crawling with different Twitter accounts is recommended

Tweet Crawler

Tweets Streaming - B

Twitter Server

Tweets Streaming - C

Tweets Streaming - A

Page 45: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

MULTI-NODE CRAWLING TYPE

• Guideline for multi-node crawling:− Automatically check connection status− Automatically update databases summary information− Design the crawl program with well log file report function− Design a good database schema for distributed accessing

Tweet CrawlerTweets Streaming - B

Twitter Server

Tweets Streaming - A

Tweet Crawler

Page 46: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

DESIGN TWEET TABLE Name Type Description Index Type

Id BIGINT UNSIGNED Unique index ID in database PRIMARY

tweet_id BIGINT UNSIGNED Official Tweet ID UNIQUE

text VARCHAR( 150 ) Tweet content -

screen_name VARCHAR( 255 ) User screen name -

user_id BIGINT UNSIGNED User ID -

followers_count INT Number of followers -

friends_count INT Number of friends -

created_at DATETIME Tweet create time -

language VARCHAR( 5 ) Language to Tweet -

source VARCHAR( 150 ) Device or browser to Tweet -

urls_count INT Number of URL in the Tweet -

Page 47: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

SETTING ENVIRONMENT

• Install packages‒ # apt-get install php5 php5-curl ‒ # apt-get install mysql-client mysql-server‒ # apt-get install phpmyadmin ‒ Set Apache2 as web server when install phpymadmin

Page 48: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

SETTING ENVIRONMENT• Create databsase and table for Tweet crawling

− Create a *.sql file for database format− Change directory to that file− # mysql -h {$HOST} -u {$USER} -p{$PASSWORD}− mysql> \. {$SQL_FILE}

Page 49: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

SETTING ENVIRONMENT• Check the database by phpmyadmin

− Open browser and connect URL http://localhost/phpmyadmin− Select database and check the structure

Page 50: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

CRAWLING REAL-TIME TWEETS• Connect database

• Save Tweet into database

Page 51: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

CRAWLING REAL-TIME TWEETS

• Copy all files in twitter_watch to /var/www/twitter_watch‒ # cp twitter_watch/server.php /var/www/twitter_watch‒ # cp twitter_watch/logic.hjs /var/www/twitter_watch‒ # cp twitter_watch/index.html /var/www/twitter_watch

• Start crawling tweets‒ $ php5 watch.php

Page 52: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

CRAWLING REAL-TIME TWEETS• Click “Browse” to show crawling Tweets in database

Page 53: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

CRAWLING REAL-TIME TWEETS• Real-Time update Tweets by JQuery

‒ Browse http://localhost/twitter_watch/index.html

Page 54: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

TROUBLESHOOTING• Access denied for user 'root'@'localhost' (using password: NO)‘

• # /etc/init.d/mysql stop• # mysqld_safe --skip-grant-tables &• # mysql -u root mysql• mysql> UPDATE user SET Password=PASSWORD(‘xxx') where USER='root';• mysql> FLUSH PRIVILEGES;• mysql> quit;• # /etc/init.d/mysql restart

• Be aware of time synchronization• # apt-get install ntp• # ntpdate -s time.stdtime.gov.tw• # hwclock --systohc

Page 55: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

URL @ TweetSURLMINE

Incremental Mining of Significant URLs in Real-Time and Large-Scale Social Streams

PAKDD 2013

Page 56: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

WHY URL?• High percentage of URLs have been embedded in Tweets

− Content length limitation and information completeness

• URL is an universal language without linguistic differences• URL is able to connect different social media platforms• Tweet with URL has been verified with low spam possibility

Social Media Character Limit Nature

Twitter 140 characters Short message

Plurk 140 characters Short message

LinkedIn 200 ~ 689 characters Job opportunities

Google+ 100,000 characters Mix information

Facebook 63,206 characters Mix information

YouTube 1,000 characters Video sharing

Page 57: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

CHALLENGE• URL shorterners make URLs hard to be analyzed• The usage of various URL shortening services are different

• URL shorterner is time-effective which could expired anytime• A general solution to expand URL shorterner to original URL• Some of URLs link to phishing websites

Keyword original bit.ly tinyurl ow.ly goo.gl others URL %

YouTube 96.49% 0.95% 0.14% 0.10% 0.12% 2.20% 90.80%

News 37.92% 17.92% 1.10% 0.00% 2.17% 40.89% 75.77%

Google 54.49% 16.30% 0.98% 2.28% 4.12% 21.83% 60.67%

Obama 30.20% 23.33% 2.27% 2.62% 2.87% 38.71% 54.22%

Page 58: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

EXPAND URL SHORTERNERS

• Recursively tracking web page redirections− Be aware of to be identified as DNS attack (cache table)− Redirection link may changes with various browsers

Page 59: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

URL STATS @ TWEETTrack Word #Tweet #URL URL % URL Per Second

YouTube 529,82,166 49,975,035 94.32 % 27.62

News 21,948,837 15,572,228 70.95 % 8.60

Http 62,976,451 42,249,898 67.09 % 23.41

Apple 4,045,333 2,670,731 66.02 % 1.48

Android 16,605,070 15,242,497 91.79 % 8.44

Obama 2,771,791 950,780 34.30 % 0.53

Page 60: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

TRACK “TAIWAN” ON TWITTER

We demand the truth and justice!

Page 61: BIG Data: Crawling Large-Scale and Real-Time Tweets With MySQL Database 2013 Open Seminar Series 6 Open Geospatial Informatics Cheng-Ying Liu (Sean) bermuda@citi.sinica.edu.tw.

Thank You

Q & A