Subject : Discovering Overlapping Groups in Social Media Professor : Dr. sh.Esmaili The Student’s...

Discovering Overlapping Groups in social Media

Subject : Discovering Overlapping Groups in Social Media

Professor : Dr. sh.Esmaili

The Student’s Identifiers :Mr. Hossien Sadrizadeh(Slides 3 to 55) Mr. Houshyar Mohammadi Talvar(Slides 57 to 78)

The Date :June 21th 2012 , (On Thursday , Tir 1th 1391 )

1/79


Mr. Hossien Sadrizadehslides from 3 to 53

2/79


Introduction• The following sites are attractive Social media sites, they

have more user than ever:• Facebook• Twitter• Wikipedia• Blogger• Myspace

• In 2009, the global time spent on social media sites increased by 82% than the year before.

• Facebook, one of the most popular social media site, has more than 500 million active users and the number is still increasing.

3/79


Introduction(Continue)• What kind of activities do the people in the social

media?• In social media websites, users are allowed to partrticipate

in social activities, for example:• Connect to the other likeminded people.• Updating their status.• Posting blogs.• Uploading photos.• Bookmark and tags.

• People can join to groups at different websites,for instance:• Fans of sports teams can join dedicated groups.• They can share their opinions on team performance.• Put comment on the newest information about player.

4/79


Group - Community• A group (community) can be considered as a set of

users where each user interacts more ferquenly with users within the group than users outside the groups.

• Some social media websites(Flicker,Youtube) provide explicit groups which allow users to join them.

• Some dynamic sites(Twitter,Delicious)have no clear group structure in it, then we need to discover community detection between them.

5/79


Group - Community

• In social media , a community is:• A group of people who are more similar with people

within the group than people outside this group.

• Homophily is one of the important reasons that people connected with others.for example:• People from the same city talk more frequently.• People have similar political viewpoints are more likely

to vote for the same candidates.• The people who watch the same movies because of the

commonly liked movie stars.

6/79


why group?

• Group-level investigation can provide usesful information.

• Studying individual behaviour is usually difficult for large population.

• Studying statistics at website level often fail to catch sufficient detail.

7/79


An example to make groups1

• We have a set of 50 people.we want to make two sets, with the following properties:

• Make a set whose the first letter’s name’s is “J”.• Make the second set whose the first letter’s name’s is “W”.

1. An example who I make it(The Presenter – hossien sadrizadeh)

http://en.wikipedia.org/wiki/Partition_of_a_set

8/79


Another example to make groups• We have a set of 50 people.we want to make

two sets, with the following properties:

• Make a set whose the first letter’s name’s is “J”.• Make the second set whose the last letter’s name’s is “W”.

Adaptation And Enhancement Of Evaluation Measure To Overlapping Graph Clustering(Tatiana Gossen, Michael Kotzyba,)

9/79


Overlapping - Introduce• The multiple interactions in social activities imply that

the community structures are often overlapping.• Example: one person is in several communications.

• We have a new idea to take advantage network information between users and tags in social media and discover these overlapping communications with co-clustering.

• Co-clustering is a way to obtain this kind of community structure.

10/79


Overlapping

• When a website have an explicit group, and allowed to the users than join to more than one group base on their personal pereferences then overlapping is take place.

• When there are no explicit groups available, community detection algorithm can be used to obtain such groups.

11/79


Community detection• Community detection are usually base on structureal

features(links).

• A sketch of a small network displaying community structure, with three groups of nodes with dense internal connections and sparser connections between groups.

12/79

http://en.wikipedia.org/wiki/File:Network_Community_Structure.svg


Co-Clustering• The graph that is on the right has two

type of nodes:

• Vertices u1-u5 on the left for users.

• T1-t4 on the right for tags.

• Edges for tag subscription relation between users and tags.

• If we use a method to make two cluster,then we’ll see that u3 is associated with two cluster.

13/79


Co-Clustering

• There are two method of clustering:• Vertices clustering.• Edges clustering.

• Instead of clustering vertices , use of clustering edges is better.

• Clustering edges usually achieves overlapping communities.

14/79


Our contribution1

• We propose to discover overlapping communities in social network.

• We use user-tag subscription information instead of user-user links.

• We obtain clusters containing users and tags simultaneousely.

1 research team

15/79


Co-Clustering

In this graph , edges connecting to nodes t1,t2 and t3,t4 are clusterd into two separate groups both containing user u3.

16/79


Community – Mathematical Defination

• Supose:

• A community Ci ( 1 i k ) is a subset of users and tags, where k is the number of community.

• Communities are usually overlap, Ci Cj .

• We use an adjacncy matrix to represent the relation between user their subscribed tag.(sparce matrix)

17/79


Adjacency Matrix via Incidence Matrix

18/79


User-Tag Network• In a user-tag network, each edge is associated with a uservertix ui and a tag vertix

tp.

• We can use of incidenc matrix.each vector in this matrix have Nu + Nt.(Nu for users and Nt for tags).

• For example :the edge between u1 and t1 in the followin graph is:

19/79


User-Tag Network

The incidence matrix

20/79


Why is the incidence matrix useful?

The incidence matrix

• It is a sparse matrix.• We can impliment it with a linked list,(or double linked list).

21/79


Overlapping co-clustering problem• The overlapping co-clustering problen can be

stated formally as follows:

Input:• A user-tag subscription matrix NNu Nt

. when Nu and Nt are the numbers of users and tags,respectively.

• K is the number of communities.

Output:• K overlapping communities which consist of both users

and tags.

22/79


The Co-Clustering Framework

• A user is usually has several friendship but, only a link is usually related to one community ,then we understand to use of cluster edges instead of nodes.

• After obtaining edge clusters, communities can be recovered by replacing each edge with its two vecrtices, i.e., a node is in a community, if any of its connection is in the community.

• Then the obtained communities are often highly overlapping.

23/79


Make Categories - Find Clusters

Communities that aggregate similar users and tags together can be detected by maximizing intra-cluster similarity, which is shown in the following equation: (this formulation can be solved by k-mean Algorithm).

24/79


Disadvantage of k-means cluster

• K-means isn’t efficent for large scale data set.

• Then, What should we do ?

• Our propose1 is use of another type of k-means . That is EdgeCluster and it is efficent, which is a scalable algorithm to extract communities for sparse network.

• Why is the Edgecluster efficent ? Because :• each centroid only compare to a small set of edges that are

correlated to the centroid.• It is reported to be able to cluster a sparse network with more than

one million nodes into thousands of clusters in tens of minues.

1 Writers

25/79


DensityThe expected density of the user-tag network is shown in the following equation :

26/79


Key Step in Clustering edge• Define edges similarity .

• Given two edges : e(ui , tp) and e’(uj , tq) in a user tag graph, the similarity between them can be define with the following equation :

27/79


Similarity Schemes for clustering

• There are 3 similarity schemes:• Independent Learning.• Normalized Learning.• Correalational Learning.

• Our framework1 can cover different similarity shemes.

1.Writer

28/79


The Kronecker delta function

Independent Learning• A public way is use

of the similarity.

• The similarity can be represent by the following function : (the user Similarity can be define at the same way )

29/79


Independent Learning – Cosine Similarity(Continue)

• The cosine similarity is widely used in measuring the similarity between two vectors.It’s define with the following form.

• Given two edges e(ui ,tp) and e’(uj ,tq), the cosine similarity can be define with the following equation:

30/79


Independent Learning – Cosine Similarity(Continue)

• The cosine of two vectors can be easily derived by using the Euclidean dot product formula:

Given two vectors of attributes, A and B, the cosine similarity, , is represented using a dot product :

31/79


An example of cosine similarity• If we have two following vectors, the similarity is :

=(1,2,3) and =(2,5,-3)

• What do you think about the range of the similarity?

• The resulting similarity ranges is in [-1,1].• −1 meaning exactly opposite.• 1 meaning exactly the same.• 0 usually indicating independence.

32/79


Text matching1

• The attribute vectors A and B are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of normalizing document length during comparison.

• In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.

1 . Co-clustering documents and words using Bipartite Spectral Graph Partitioning (S. Dhillon)

33/79


Normalized Learning• Let dui

, denoted the degree of the user ui , and dtp

represent the degree of tag tp in a user-tag network.

• After do the normalization , edge e(ui ,tp), can be represented by the following form :

1 The Research team

34/79


Normalized Learning(continue)• If given two edges e(ui ,tp), and e’(uj,tq), the

cosine similarity between them after normalization can be written the following equation :

35/79


Normalized Learning(continue)

• if we set by 0.5 , then we can derive the following equation that tell us normalized edge similarity.

• These formula say that the similarity between two users is not only related to users, but also the tags.

36/79


Correlational Learning• Users often use more than one tag to describe the main topic

of a bookmark.

• A grouped tags indicates their Correlation.

• In a user-tag network :• At the first side, a user can be viewed as a vector by treating tags as

features.• At the other side , a tag can also be viewed as a vector by treating

by users as features.

• we use a latent space to represent the users and display correlation between their tags.

37/79


Correlational Learning(Continue)

• Let’s take the following basis vector in the orthogonal latent’s axsis :

• Users vectors in the original space can be mapped to new vectors in the latent space, which is shown like this :

M is a linear mapping from the original space to the latent space

38/79



• We mapped the real vectors from a real space to the Latent space.like this : (we use a Mapped function)

39/79



• Another method to select a set of orthogonal basis is Singular Value Decomposition(SVD).

• The singular value decomposition for a user-tag network M is given by the following formula:

40/79



• User latents can be formulated with the following form :

• We need only a small set of vector to comput them.it is here:

41/79



• User similarity and tag similarity are defined by the following formula in the latent space :

• Z solved and derived the generalized eigenvectors.

42/79



• The Adjacency matrix and Laplacian matrix are :

43/79



• The generalized eigenvector can be rewritten by :

• After simple manipulation , we obtain :

44/79


Singular Value Decomposition – SVD1,2

• SVD is base on theorem from linear algebra which says that, a rectangular matrix A can be broken down into the product of three matrices:• An orthogonal matrix U.• A diagonal matrix S.• Transpose of an orthogonal materix V.

• Gram-Schmidt orthogonalization process.• Is a method for converting a set of vectors into a set of

orthonormal vectors.• It uses of normalization method.

1.Linear Algebra, Haffman Kenneth. (chapter 8 : vector spaces)2.Numerical Analysis, Samuel D.Conte.(Chapter 4 : Matrixes and eigen values, eigen vectors )

45/79


Other view point of Gram-Schmidt

46/79


Singular Value Decomposition – SVD

• The theorem is usually presented with a formula like this :

47/79


Example - SVD

• Start with the matrix:

• To find the U we have to find AAT.

48/79


Example – SVD(Continue)• Next , we have to find the eigenvalues and

corresponding eigenvectors of AAT.

• If we find the eigenvectors and store in a matrix order by size of the corresponding of eigenvalue.

49/79


Example – SVD(Continue)

• Finally,we have to convert this matrix into an orthogonal matrix.

50/79



• We use a similar method to find V, base on ATA.

• Find the eigenvalues for ATA.

51/79



• For all of data we have the following vectors:

• According to the size of eigenvalue, we have:

52/79


Example – SVD(Continue)• After orthonormalization process, and the convert

that to an orthogonal matrix.

53/79


Example – SVD(Continue)• For S we take the Square roots of the non-zero eigenvalues

and populate with them,putting the largest in S11, the next largest in S22 and so on.the smallest value in Smm.

• The non-zero eigenvalues of U and V are the same.• The diagonal entries in Sare the singular values of A.• The columns in U are called left singular vectors.• The columns in V are called right singular vectors.

54/79


Example – SVD(Continue)• Now we have the following matrixes:

55/79


Mr.Houshyar Mohammadi Talvarslides from 57 to 78

56/79


SYNTHETIC DATA AND FINDINGS

Clustering evaluation is difficult when there is no ground truth.We first introduce the synthetic data and how they are generated, then the clustering quality measurement Normalized Mutual Information (NMI). Finally, the NMI of different clustering methods are reported.

We develop a synthetic data generator that allows input of the numbers of clusters, users and tags. First users and tags are split evenly into each cluster. Then, in each cluster users and tags are randomly connected with a specified density (e.g., 0.8).

Synthetic Data Generation

57/79


Figure 2, shows a toy example of the synthetic user-tag graph in which users are labeled as u1−u7 and tags t1−t8. Three overlapping clusters are highlighted with different colors.

58/79


NMI Evaluation in Synthetic Data

The Normalized Mutual Information (NMI) is commonly used to measure the clustering quality.Given two clusterings X and Y, the NMI is defined below.

The NMI is computed in two steps

First, find the pairs of clusters that are most close to each other in two clusterings

Second, average the mutual information between those pairs of clusters

The higher the NMI value is, the more similar between two clusterings. If two clusterings X and Y are exactly the same, the NMI value is 1.

59/79


NMI and Number of Clusters

We generate another data set with 1,000 users and 1,000 tags and with different number of clusters which range from 5 to 50 and cluster density is set to 1 such that allusers connect to all tags within each cluster.

Figure 3. NMI Performance w.r.t Number of Clusters

60/79


NMI and Link Density

We also study how intra-cluster link density affects clustering in synthetic data sets. We created synthetic data sets (50 clusters, 1,000 users and 1,000 tags) with different intracluster densities that range from 0.1 to 1.

Figure 4. NMI Performance w.r.t Intra-cluster Link Density

61/79


Figure 3. NMI Performance w.r.t Number of Clusters

Figure 4. NMI Performance w.r.t Intra-cluster Link Density

View Correlational Learning in Figure 3 & figure 4

62/79


SOCIAL MEDIA DATA AND FINDINGS

BlogCatalog is a social blog directory where the bloggers can register their blogs under predefined categories.We crawled user names, user ids, their friends, blogs, theassociated tags and blog categories.

Delicious is a social bookmarking website, which allows users to tag, manage, and share online resources (e.g.,articles). For each resource, users are asked to provideseveral tags to summarize its main topic.

63/79


Interplay between Link Connection and Tag Sharing

There exist explicit and implicit relations between users.Examples of explicit relations are friends or fans people choose to be. Examples of implicit relations are tag sharing,i.e., people who use the same tags.

Are there any correlation between the two different relations? What drives people connect to others? Is it a random operation? We conducted statistical analysis between user-user links and tag sharing.

In the first study, we fix users who have or have no connection with others, then show the tag sharing probabilities.

64/79


Figure 5 shows the tag sharing probabilities in BlogCatalog and Delicious data sets. For Delicious data, the friends network and fans network are evaluated separately.

Interplay between Link Connection and Tag Sharing(countinu)

Figure 5. X-axis represents the number of tags that two users share.

65/79


Figures 6 and 7 are the probability that two users being connected if they share tags in BlogCatalog and Delicious,respectively. In Figure 6, the probability of a link between two users increases with respect to the number of tags they share

Figure 6. Link probability w.r.t tag sharing in BlogCatalog


66/79


Figure 7. Link probability w.r.t tag sharing in Delicious


67/79


Clustering Evaluation

The clustering evaluation consists of three studies:

1. First,cross-validation is performed to demonstrate the effectiveness of different clustering algorithms in BlogCatalog data set.

2. we study the correlation between user connectivity and co-occurrence in extracted communities

3. concrete examples illustrate what clusters are about.

68/79


1) Comparative Study: In BlogCatalog, categories for each blog are selected by the blog owner from a predefined list.With category information, certain procedures such as cross validation (e.g., treating categories as class labels,cluster memberships as features) can be used to show the clustering quality.

Linear SVM is adopted in our experiments since it scales well to large data sets. As recommended by Tang et al, 1,000 communities are used in our experiments. We vary the fraction of training data from 10% to 90% and use the rest as test data.

This experiment is repeated for 10 times and the average Micro-F1 and Macro-F1 measures are reported.

69/79


Table II shows five different clustering methods and their prediction performance. In this table, the fourth algorithm EdgeCluster uses user-user network rather than the usertag network. Dhillon’s co-clustering algorithm is based on Singular Value Decomposition (SVD) of the normalized user-tag matrix.

As shown in Table II, Correlational Learning consistently performs better, especially when the training set is small. According to Table II, normalization does not improve performance. This suggests normalization should be taken cautiously. Dhillon’s co-clustering method which can only deal with non-overlapping clustering does not perform well compared to other methods.

70/79


2) Connectivity Study: We study the correlation between user co-occurrence in extracted communities and the actual social connections between themWe also study the connectivity between users who are in the top similar list. 1,000 overlapping communities are extracted by Correlational Learning.

71/79


We study the dis-connectivity between users who are most similar. Figure 8 shows that the probability of being disconnected is higher than 96% and 99% in BlogCatalog and Delicious, respectively, which means that the majority of homogeneous users are not connected in actual social networks.

For example, users marama6 and ameer1577 both are interested in the online game “World of Warcraft”.

Figure 8. Probability being Dis-connected between Top Similar Users 72/79


3) Illustrative Examples: Health is the second largest category (the largest is personal) in BlogCatalog, a hot topic that attracts lots of cares.

73/79


The largest cluster about Health obtained by Correlational Learning is cluster-health with 127 users and 102 tags. The cluster that has the maximum user overlapping with clusterhealth is cluster-nutrition with 83 users and 25 tags. Their tag clouds are shown in Figures 10 and 11. Between the two clusters, there are 18 users and 3 tags health, nutrition and weight loss in common. Both clusters are related to health but the first has an emphasis on physical health, highlighted by tags arthritis, drugs, food, dentist, and the second is more about nutrition.

74/79


The top 102 tags of categoryhealth are compared to the tags of cluster-health and the top25 tags of category-health to those of cluster-nutrition. The numbers of shared tags are 16 for cluster-health and 9 for cluster-nutrition.

75/79


In addition, we aggregate tags of the users in cluster health and present the most frequent 102 tags in Figure 12. Comparing these tags with those of cluster-health, 40 tags are in common. Many tags such as environment, humor, jokes are not present in the tag cloud of cluster-health, which suggests that these users actually have other interests besides health. A similar pattern is observed for cluster nutrition.

76/79

Discovering Overlapping Groups in social Media 77/79


CONCLUSIONS AND FUTURE WORK

This study suggests more interesting problems that are worth further exploring. Formulating the co-clustering problem into an objective function and maximizing it is one direction to work on.

We proposed a framework to study the overlapping clustering of users and tags in online social media which helps to understand the major concerns within the groups. Experimental results in synthetic data reveal that Correlational Learning isvery effective in recovering the overlapping cluster structures even when the inner cluster density is low.

78/79


?79/79

Subject : Discovering Overlapping Groups in Social Media Professor : Dr. sh.Esmaili The Student’s...

Documents

Transcript of Subject : Discovering Overlapping Groups in Social Media Professor : Dr. sh.Esmaili The Student’s...