Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies Akshay Java Anupam...

Post on 05-Jan-2016

228 views 2 download

Tags:

Transcript of Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies Akshay Java Anupam...

Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies

Akshay JavaAnupam JoshiTim Finin

University of Maryland, Baltimore County

KDD 2008Workshop on Web Mining and Web Usage Analysis

• Introduction• Community Detection

– Clustering Approach– Spectral Approach– Co-Clustering

• Simultaneous Clustering• Evaluation• Future Work• Conclusions

Outline

• Introduction• Community Detection

– Clustering Approach– Spectral Approach– Co-Clustering

• Simultaneous Clustering• Evaluation• Future Work• Conclusions

Outline

Social Media

Describes the online technologies

and practices that people use to

share opinions, insights,

experiences, and perspectives

and engage with each other.

~Wikipedia

Social Media Graphs

G = (V,E) describing the relationships between different entities (People, Documents, etc.)

G’ = <V,T,R> a tri-partite graph that expresses how entities ‘Tag’ some resource

11 22 33 44

11 22Tags

11 22 33 44 URLs

Users

A community in the real world is identified in a graph as a set of nodes that have more links within the set than outside it.

Political Blogs

Twitter Network

Facebook Network

What is a Community

• Introduction• Community Detection

– Clustering Approach– Spectral Approach– Co-Clustering

• Simultaneous Clustering• Evaluation• Future Work• Conclusions

Outline

Community DetectionClustering Approach

Clustering Approach1. Agglomerative/Hierarchical

Topological Overlap: Similarity is measured in terms of number of nodes that both i and j link to. (Razvasz et al.)

Community DetectionClustering Approach

Clustering Approach1. Agglomerative/Hierarchical

2. Divisive/Partition based

Remove edges that have highest edge betweenness centrality

Political Books

(Girvan-Newman Algorithm)

Community DetectionSpectral Approach

• The graph can be partitioned using the eigenspectrum of the Laplacian. (Shi and Malik)

• The second smallest eigenvector of the graph Laplacian is the Fiedler vector.

• The graph can be recursively partitioned using the sign of the values in its Fielder vector.

L = D −W = I − D−

1

2 *W * D−

1

2

NCut(A,B) = Cut(A,B)1

Vol(A)+

1

Vol(B)

⎣ ⎢

⎦ ⎥

Normalized Cuts

Graph Laplacian

Cost of edges deleted to disconnect the graph

Total cost of all edges that start from B

Community DetectionCo-Clustering

• Spectral graph bipartitioning• Compute graph laplacian using

Where is the document by term matrix

(Dhillon et al.)€

A ∈ ℜn×m

• Introduction• Community Detection

– Clustering Approach– Spectral Approach– Co-Clustering

• Simultaneous Clustering• Evaluation• Future Work• Conclusions

Outline

Social Media Graphs

Links Between Nodes Links Between Nodes and Tags

Simultaneous Cuts

A community in the real world is identified in a graph as a set of nodes that have more links within the set than outside it and share similar tags.

Communities in Social Media

Clustering Tags and Graphs

1 1 1 0 0

1 1 1 0 0

1 0 1 1 0

1 0 0 1 1

1 0 0 1 1

1 1 0 0 0 1 1 1 0

1 1 1 0 0 1 1 0 0

0 0 1 1 1 0 0 1 1

0 0 0 1 1 0 0 1 1

⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟

Nodes

Nodes

Nod

esT

ags

Tag

sN

odes

Tags

Tags

1

1

−1

−1

−1

1

1

−1

−1

Fiedler Vector Polarity

W ' =I C

C T βW

⎝ ⎜

⎠ ⎟

β= 0 is like co-clustering,

β= 1 Equal importance to blog-blog and blog-tag,

β>> 1 NCut

Clustering Tags and Graphs

β= 0 is like co-clustering,

β= 1 Equal importance to blog-blog and blog-tag,

β>> 1 NCut

Clustering Only Links

Clustering Links + Tags

W ' =I C

C T βW

⎝ ⎜

⎠ ⎟

Clustering Tags and GraphsClustering Only Links

Clustering Links + Tags

• Introduction• Community Detection

– Clustering Approach– Spectral Approach– Co-Clustering

• Simultaneous Clustering• Evaluation• Future Work• Conclusions

Outline

Datasets

• Citeseer– Agents, AI, DB, HCI, IR, ML– Words used in place of tags

• Blog data – derived from the WWE/Buzzmetrics dataset– Tags associated with Blogs derived from del.icio.us– For dimensionality reduction 100 topics derived from blog homepages using LDA (Latent Dirichilet Allocation)

• Pairwise similarity computed – RBF Kernel for Citeseer– Cosine for blogs

Citeseer Data

Accuracy = 36% Accuracy = 62%

Higher accuracy by adding ‘tag’ information

SimCut Results in• Higher intra-cluster similarity• Lower inter-cluster similarity

Citeseer DataNCut SimCut

Constrains cuts based on both• Link Structure• Tags

Citeseer DataNCut SimCutTrue

SimCut Results in• Higher intra-cluster similarity• Lower inter-cluster similarity

Blog DataNCut SimCut

Blog DataNCut SimCut

Ncut

Few, Large clusters with low intra-cluster similarity

SimCut

Moderate size clusters higher intra-cluster similarity

35 Clusters

Effect of Number of Tags, ClustersCiteseer

More tags help, to an extent

Lower mutual information if only the graph is used

Mutual Information compares clusters to ground truth

Effect of Number of Tags, ClustersBlogs

More tags help, to an extent

Lower mutual information if only the graph is used

Mutual Information compares clusters to content-based clusters (no tags/graph)

• Introduction• Community Detection

– Clustering Approach– Spectral Approach– Co-Clustering

• Simultaneous Clustering• Evaluation• Future Work• Conclusions

Outline

Future Work

• Evaluating SimCut algorithm on derived feature types like: named entities, sentiments and opinions, links to main stream media.

• For a dataset with ground truth, a comparison of graph based, text based and graph+tag based clustering

• Evaluating effect of varying β

• Introduction• Community Detection

– Clustering Approach– Spectral Approach– Co-Clustering

• Simultaneous Clustering• Evaluation• Future Work• Conclusions

Outline

Conclusions

• Many Social Media sites allow users to tag resources

• Incorporating folksonomies in community detection can yield better results

• SimCut can be easily implemented and relates to Ncut with two simultaneous objectives– Minimize number of node-node edges being cut– Minimize number of node-tag edges being cut

• Detected communities can be associated with meaningful, descriptive tags

Thanks!

http://ebiquity.umbc.eduhttp://socialmedia.typepad.com

More Tags

Only Graph SimCut

Citeseer (Community Size, Similarity)

Blogs (Community Size, Similarity)