Twitter6
-
Upload
dae-myung-kang -
Category
Technology
-
view
835 -
download
0
description
Transcript of Twitter6
Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet #2
Data Mining
Discover New Knowledge
Discover New Knowledge
from Existing Information
What do #TeaParty and #JustinBieber
have in common
Tools: Pymongo, MongoDB
apt-get install python-dev pip install pymongo
from pymongo.connection import Connection import sys import tweepy connection = Connection("localhost") db = connection.foo import tweepy api = tweepy.API() tweets = api.search('#JustinBieber', rpp=100) for tweet in tweets: db.foo.save(tweet.__getstate__())
Get Tweets
from pymongo.connection import Connection import sys import tweepy connection = Connection("localhost") db = connection.foo import tweepy api = tweepy.API() for num in range(1,16): tweets = api.search('#JustinBieber', rpp=100, page=num) for tweet in tweets: db.foo.save(tweet.__getstate__())
Insert TO MongoDB
map = function(){ words = this.text.split(' '); for ( i in words ){ emit({ key: words[i] }, {count: 1}); } };
Count Frequency in mongo
MAP
reduce = function (key, values) { var count = 0; values.forEach(function (v) {count += v.count;}); return {count:count}; }
Count Frequency in mongo
REDUCE
res = db.foo.mapReduce( map, reduce, {out: "mystring"});
Count Frequency in mongo
EXECUTE
{ "_id" : { "key" : "#1000ADay" }, "value" : { "count" : 1 } } { "_id" : { "key" : "#1000aday" }, "value" : { "count" : 1 } } { "_id" : { "key" : "#500ADay" }, "value" : { "count" : 1 } } { "_id" : { "key" : "#500aday" }, "value" : { "count" : 1 } } { "_id" : { "key" : "#AutoFollow" }, "value" : { "count" : 1 } } { "_id" : { "key" : "#Bieber" }, "value" : { "count" : 1 } }
Count Frequency in mongo
RESULT
from pymongo.connection import Connection import sys import tweepy connection = Connection("localhost") db = connection.foo cursor = db.mystring.find() for d in cursor: print d
Get From MongoDB
What Entities Co-Occur Most Often with #JustinBieber and
#TeaParty Tweets?
import sys from sets import Set if __name__=='__main__': r1 = open( sys.argv[1] ) r2 = open( sys.argv[2] ) s1 = Set() s2 = Set() for line in r1.readlines(): key = line.split() if( len(key) > 0 ): s1.add(key[0]) for line in r2.readlines(): key = line.split() if( len(key) > 0 ): s2.add(key[0]) s3 = s1.intersection(s2) print len(s1) print len(s2) print len(s3)
intersection
On Average, Do #JustinBieber or #TeaParty Tweets Have
More Hashtags?
Which Get Retweeted More Often: #JustinBieber or
#TeaParty?
How Much Overlap Exists Between the Entities of
#TeaParty and #JustinBieber Tweet?
Thank You!