Mining social data

47
Mining Social Data FOSDEM 2013

description

FOSDEM 2013 presentation on Techniques used for mining the social web on graphs.

Transcript of Mining social data

Page 1: Mining social data

Mining Social DataFOSDEM 2013

Page 3: Mining social data

I work at Linagora, a french FLOSS co.

Page 4: Mining social data

EloData & OpenGraphMiner

Linagora's foray into ESN, DataStorage, Graphs & Mining.

Page 5: Mining social data

Why mine social data at all?

Without being a creepy stalker

Page 6: Mining social data

To see what humans can't.

Influence, centers of interest.

Page 7: Mining social data

To remeber what humans can't.

What worked in the past? Objectively how did I behave until now?

Page 8: Mining social data

To discover what humans won't.

Page 9: Mining social data

SerendipityFind what tou were not looking for

Page 10: Mining social data

Real life social dataWhat is so specific about it?

Page 11: Mining social data

Always graphs

Page 12: Mining social data

Dense substructuresEvery Vertex is an unique entity (someone).

Several dense subgraphs: Relations of poaches of people

Page 13: Mining social data

Usually it has no good cutsEven the best partition algorithms cannot find

partitions that are just not there

Page 14: Mining social data

There will beerrors & unknowns

Exact matching is not an option

Page 15: Mining social data

Plenty of vanity metrics pollution.

Sometimes very surprising ones.

Page 16: Mining social data

Number of followers is a vanity metric

@GuyKawasaki (~1.5M followers) is much more retweeted than the user with most followers

(@justinbieber, ~34M)

Page 17: Mining social data

Why use graphs?What is the itch with Inductive Logic that Inductive

Graphs scratch?

Page 18: Mining social data

'Classic' Data MiningPros and cons

Page 19: Mining social data

pro: Solid known techniquesof good performance

Page 20: Mining social data

con: Complex structuresare translated

Into Bayesian Networks or Multi-Relational tables:Incurring either data loss or combinatory explosion.

Page 21: Mining social data

Graph Mining'The new deal'

Page 22: Mining social data

pro: Expressivenessand simplicity

The input and output are graphs, no conversions, graph algorithms all around.

Page 23: Mining social data

con: The unit of operation is comparing isomorphisms

NP-Complete

Page 24: Mining social data

ExtractionGetting the data

Page 25: Mining social data

Is the easy partA commodity really.

Page 26: Mining social data

Social networks provide API

Facebook Graph api, Twitter REST api, yammer api etc.

Page 27: Mining social data

Worst case:Crawl the websiteCrawling The Web For Fun And Profit:

http://youtu.be/eQtxbaw__W8

Page 28: Mining social data

import sysimport jsonimport twitterimport networkx as nxfrom recipe__get_rt_origins import get_rt_origins

def create_rt_graph(tweets): g = nx.DiGraph() for tweet in tweets: rt_origins = get_rt_origins(tweet) if not rt_origins: continue for rt_origin in rt_origins: g.add_edge(rt_origin.encode('ascii', 'ignore'), tweet['from_user'].encode('ascii', 'ignore'), {'tweet_id': tweet['id']} ) return g

if __name__ == '__main__': Q = ' '.join(sys.argv[1]) MAX_PAGES = 15 RESULTS_PER_PAGE = 100 twitter_search = twitter.Twitter(domain='search.twitter.com') search_results = [] for page in range(1,MAX_PAGES+1): search_results.append( twitter_search.search(q=Q, rpp=RESULTS_PER_PAGE, page=page) ) all_tweets = [tweet for page in search_results for tweet in page['results']] g = create_rt_graph(all_tweets) print >> sys.stderr, "Number nodes:", g.number_of_nodes() print >> sys.stderr, "Num edges:", g.number_of_edges() print >> sys.stderr, "Num connected components:", len(nx.connected_components(g.to_undirected())) print >> sys.stderr, "Node degrees:", sorted(nx.degree(g))

https://github.com/ptwobrussell/Recipes-for-Mining-Twitter

Page 29: Mining social data

Finding patternssubstructures that repeat

Page 30: Mining social data

Older optionsApriori-based, Pattern growth

Page 31: Mining social data

Stepwise pair expansionSeparate the graph by pairs, count frequencies, keep

most frequent, augment them by one repeat.

Page 32: Mining social data

"Chunk": Separate the graph by pairs

Page 33: Mining social data

Keep only the frequent ones

Page 34: Mining social data

Expand them

Page 35: Mining social data

Find your frequent pattern

Page 36: Mining social data

con: Chunkiness

Page 37: Mining social data

"ChunkingLess"Graph Based Induction

CL-CBI [Cook et. al.]

Page 38: Mining social data

Inputs needed

1. Minimal frequency where we consider a conformation to be a pattern : threshold

2. Number of most frequent pattern we will retain : beam size

3. Arbitrary number of times we will iterate: levels

Page 39: Mining social data

1. "Chunk": Separate the graph by pairs

Page 40: Mining social data

2. Select beam-size most frequent ones

Page 41: Mining social data

3. Turn selected pairs into pseudo-nodes

Page 42: Mining social data

4. Expand & Rechunk

Page 43: Mining social data

Keep going back to step 2Until you have done it levels times.

Page 44: Mining social data

Decision Trees

Page 45: Mining social data

A Tree of patternsFinding a pattern on a branch yields a decision

Page 46: Mining social data

DT-CLGBI

Page 47: Mining social data

DT-CLGBI(graph: D)begin create_node DT in D if thresold-attained return DT else P <- select_most_discriminative(CL-CBI(D)) (Dy, Dn) <- branch_DT_on_predicate(p) for Di <- Dy DT.branch_yes.add-child(DT-CLGBI(Di)) for Di <- Dn DT.branch_no.add-child(DT-CLGBI(Di))