Mining social data

Mining Social DataFOSDEM 2013

Credits

SpeakerCompany

LicenseSlideShar

eSources

Romeu "@malk_zameth" MOURA@linagoraCC-BY-SA 3.0j.mp/XXgBAn● Mining Graph Data● Mining the Social Web● Social Network Analysis for

startups● Social Media Mining and Social

Network Analysis● Graph Mining

https://twitter.com/malk_zameth

https://twitter.com/linagora

https://twitter.com/linagora

http://creativecommons.org/licenses/by-sa/3.0/

http://creativecommons.org/licenses/by-sa/3.0/

http://j.mp/XXgBAn

http://j.mp/XXgBAn

http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0471731900.html

http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0471731900.html

http://shop.oreilly.com/product/0636920010203.do






http://www.igi-global.com/book/social-media-mining-social-network/69849




http://www.morganclaypool.com/doi/abs/10.2200/S00449ED1V01Y201209DMK006?journalCode=dmk

http://www.morganclaypool.com/doi/abs/10.2200/S00449ED1V01Y201209DMK006?journalCode=dmk

I work at Linagora, a french FLOSS co.

http://linagora.com

http://linagora.com/

EloData & OpenGraphMiner

Linagora's foray into ESN, DataStorage, Graphs & Mining.

Why mine social data at all?

Without being a creepy stalker

To see what humans can't.

Influence, centers of interest.

To remeber what humans can't.

What worked in the past? Objectively how did I behave until now?

To discover what humans won't.

SerendipityFind what tou were not looking for

Real life social dataWhat is so specific about it?

Always graphs

Dense substructuresEvery Vertex is an unique entity (someone).

Several dense subgraphs: Relations of poaches of people

Usually it has no good cutsEven the best partition algorithms cannot find

partitions that are just not there

There will beerrors & unknowns

Exact matching is not an option

Plenty of vanity metrics pollution.

Sometimes very surprising ones.

Number of followers is a vanity metric

@GuyKawasaki (~1.5M followers) is much more retweeted than the user with most followers

(@justinbieber, ~34M)

Why use graphs?What is the itch with Inductive Logic that Inductive

Graphs scratch?

'Classic' Data MiningPros and cons

pro: Solid known techniquesof good performance

con: Complex structuresare translated

Into Bayesian Networks or Multi-Relational tables:Incurring either data loss or combinatory explosion.

Graph Mining'The new deal'

pro: Expressivenessand simplicity

The input and output are graphs, no conversions, graph algorithms all around.

con: The unit of operation is comparing isomorphisms

NP-Complete

ExtractionGetting the data

Is the easy partA commodity really.

Social networks provide API

Facebook Graph api, Twitter REST api, yammer api etc.

http://developers.facebook.com/docs/getting-started/graphapi/

https://dev.twitter.com/docs/api/1.1

http://developer.yammer.com/restapi/

Worst case:Crawl the websiteCrawling The Web For Fun And Profit:

http://youtu.be/eQtxbaw__W8





import sysimport jsonimport twitterimport networkx as nxfrom recipe__get_rt_origins import get_rt_origins

def create_rt_graph(tweets): g = nx.DiGraph() for tweet in tweets: rt_origins = get_rt_origins(tweet) if not rt_origins: continue for rt_origin in rt_origins: g.add_edge(rt_origin.encode('ascii', 'ignore'), tweet['from_user'].encode('ascii', 'ignore'), {'tweet_id': tweet['id']} ) return g

if __name__ == '__main__': Q = ' '.join(sys.argv[1]) MAX_PAGES = 15 RESULTS_PER_PAGE = 100 twitter_search = twitter.Twitter(domain='search.twitter.com') search_results = [] for page in range(1,MAX_PAGES+1): search_results.append( twitter_search.search(q=Q, rpp=RESULTS_PER_PAGE, page=page) ) all_tweets = [tweet for page in search_results for tweet in page['results']] g = create_rt_graph(all_tweets) print >> sys.stderr, "Number nodes:", g.number_of_nodes() print >> sys.stderr, "Num edges:", g.number_of_edges() print >> sys.stderr, "Num connected components:", len(nx.connected_components(g.to_undirected())) print >> sys.stderr, "Node degrees:", sorted(nx.degree(g))

https://github.com/ptwobrussell/Recipes-for-Mining-Twitter

Finding patternssubstructures that repeat

Older optionsApriori-based, Pattern growth

Stepwise pair expansionSeparate the graph by pairs, count frequencies, keep

most frequent, augment them by one repeat.

"Chunk": Separate the graph by pairs

Keep only the frequent ones

Expand them

Find your frequent pattern

con: Chunkiness

"ChunkingLess"Graph Based Induction

CL-CBI [Cook et. al.]

Inputs needed

1. Minimal frequency where we consider a conformation to be a pattern : threshold

2. Number of most frequent pattern we will retain : beam size

3. Arbitrary number of times we will iterate: levels

1. "Chunk": Separate the graph by pairs

2. Select beam-size most frequent ones

3. Turn selected pairs into pseudo-nodes

4. Expand & Rechunk

Keep going back to step 2Until you have done it levels times.

Decision Trees

A Tree of patternsFinding a pattern on a branch yields a decision

DT-CLGBI

DT-CLGBI(graph: D)begin create_node DT in D if thresold-attained return DT else P <- select_most_discriminative(CL-CBI(D)) (Dy, Dn) <- branch_DT_on_predicate(p) for Di <- Dy DT.branch_yes.add-child(DT-CLGBI(Di)) for Di <- Dn DT.branch_no.add-child(DT-CLGBI(Di))

Mining social data

Technology

Transcript of Mining social data