Mining social data
-
Upload
malk-zameth -
Category
Technology
-
view
849 -
download
0
description
Transcript of Mining social data
![Page 1: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/1.jpg)
Mining Social DataFOSDEM 2013
![Page 2: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/2.jpg)
Credits
SpeakerCompany
LicenseSlideShar
eSources
Romeu "@malk_zameth" MOURA@linagoraCC-BY-SA 3.0j.mp/XXgBAn● Mining Graph Data● Mining the Social Web● Social Network Analysis for
startups● Social Media Mining and Social
Network Analysis● Graph Mining
![Page 4: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/4.jpg)
EloData & OpenGraphMiner
Linagora's foray into ESN, DataStorage, Graphs & Mining.
![Page 5: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/5.jpg)
Why mine social data at all?
Without being a creepy stalker
![Page 6: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/6.jpg)
To see what humans can't.
Influence, centers of interest.
![Page 7: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/7.jpg)
To remeber what humans can't.
What worked in the past? Objectively how did I behave until now?
![Page 8: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/8.jpg)
To discover what humans won't.
![Page 9: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/9.jpg)
SerendipityFind what tou were not looking for
![Page 10: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/10.jpg)
Real life social dataWhat is so specific about it?
![Page 11: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/11.jpg)
Always graphs
![Page 12: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/12.jpg)
Dense substructuresEvery Vertex is an unique entity (someone).
Several dense subgraphs: Relations of poaches of people
![Page 13: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/13.jpg)
Usually it has no good cutsEven the best partition algorithms cannot find
partitions that are just not there
![Page 14: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/14.jpg)
There will beerrors & unknowns
Exact matching is not an option
![Page 15: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/15.jpg)
Plenty of vanity metrics pollution.
Sometimes very surprising ones.
![Page 16: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/16.jpg)
Number of followers is a vanity metric
@GuyKawasaki (~1.5M followers) is much more retweeted than the user with most followers
(@justinbieber, ~34M)
![Page 17: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/17.jpg)
Why use graphs?What is the itch with Inductive Logic that Inductive
Graphs scratch?
![Page 18: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/18.jpg)
'Classic' Data MiningPros and cons
![Page 19: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/19.jpg)
pro: Solid known techniquesof good performance
![Page 20: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/20.jpg)
con: Complex structuresare translated
Into Bayesian Networks or Multi-Relational tables:Incurring either data loss or combinatory explosion.
![Page 21: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/21.jpg)
Graph Mining'The new deal'
![Page 22: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/22.jpg)
pro: Expressivenessand simplicity
The input and output are graphs, no conversions, graph algorithms all around.
![Page 23: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/23.jpg)
con: The unit of operation is comparing isomorphisms
NP-Complete
![Page 24: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/24.jpg)
ExtractionGetting the data
![Page 25: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/25.jpg)
Is the easy partA commodity really.
![Page 26: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/26.jpg)
Social networks provide API
Facebook Graph api, Twitter REST api, yammer api etc.
![Page 27: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/27.jpg)
Worst case:Crawl the websiteCrawling The Web For Fun And Profit:
http://youtu.be/eQtxbaw__W8
![Page 28: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/28.jpg)
import sysimport jsonimport twitterimport networkx as nxfrom recipe__get_rt_origins import get_rt_origins
def create_rt_graph(tweets): g = nx.DiGraph() for tweet in tweets: rt_origins = get_rt_origins(tweet) if not rt_origins: continue for rt_origin in rt_origins: g.add_edge(rt_origin.encode('ascii', 'ignore'), tweet['from_user'].encode('ascii', 'ignore'), {'tweet_id': tweet['id']} ) return g
if __name__ == '__main__': Q = ' '.join(sys.argv[1]) MAX_PAGES = 15 RESULTS_PER_PAGE = 100 twitter_search = twitter.Twitter(domain='search.twitter.com') search_results = [] for page in range(1,MAX_PAGES+1): search_results.append( twitter_search.search(q=Q, rpp=RESULTS_PER_PAGE, page=page) ) all_tweets = [tweet for page in search_results for tweet in page['results']] g = create_rt_graph(all_tweets) print >> sys.stderr, "Number nodes:", g.number_of_nodes() print >> sys.stderr, "Num edges:", g.number_of_edges() print >> sys.stderr, "Num connected components:", len(nx.connected_components(g.to_undirected())) print >> sys.stderr, "Node degrees:", sorted(nx.degree(g))
https://github.com/ptwobrussell/Recipes-for-Mining-Twitter
![Page 29: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/29.jpg)
Finding patternssubstructures that repeat
![Page 30: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/30.jpg)
Older optionsApriori-based, Pattern growth
![Page 31: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/31.jpg)
Stepwise pair expansionSeparate the graph by pairs, count frequencies, keep
most frequent, augment them by one repeat.
![Page 32: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/32.jpg)
"Chunk": Separate the graph by pairs
![Page 33: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/33.jpg)
Keep only the frequent ones
![Page 34: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/34.jpg)
Expand them
![Page 35: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/35.jpg)
Find your frequent pattern
![Page 36: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/36.jpg)
con: Chunkiness
![Page 37: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/37.jpg)
"ChunkingLess"Graph Based Induction
CL-CBI [Cook et. al.]
![Page 38: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/38.jpg)
Inputs needed
1. Minimal frequency where we consider a conformation to be a pattern : threshold
2. Number of most frequent pattern we will retain : beam size
3. Arbitrary number of times we will iterate: levels
![Page 39: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/39.jpg)
1. "Chunk": Separate the graph by pairs
![Page 40: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/40.jpg)
2. Select beam-size most frequent ones
![Page 41: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/41.jpg)
3. Turn selected pairs into pseudo-nodes
![Page 42: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/42.jpg)
4. Expand & Rechunk
![Page 43: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/43.jpg)
Keep going back to step 2Until you have done it levels times.
![Page 44: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/44.jpg)
Decision Trees
![Page 45: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/45.jpg)
A Tree of patternsFinding a pattern on a branch yields a decision
![Page 46: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/46.jpg)
DT-CLGBI
![Page 47: Mining social data](https://reader034.fdocuments.us/reader034/viewer/2022051212/55878176d8b42abd4c8b470a/html5/thumbnails/47.jpg)
DT-CLGBI(graph: D)begin create_node DT in D if thresold-attained return DT else P <- select_most_discriminative(CL-CBI(D)) (Dy, Dn) <- branch_DT_on_predicate(p) for Di <- Dy DT.branch_yes.add-child(DT-CLGBI(Di)) for Di <- Dn DT.branch_no.add-child(DT-CLGBI(Di))