Subject : Discovering Overlapping Groups in Social Media Professor : Dr. sh.Esmaili The Student’s...
-
Upload
paula-arnold -
Category
Documents
-
view
215 -
download
0
Transcript of Subject : Discovering Overlapping Groups in Social Media Professor : Dr. sh.Esmaili The Student’s...
Discovering Overlapping Groups in social Media
Subject : Discovering Overlapping Groups in Social Media
Professor : Dr. sh.Esmaili
The Student’s Identifiers :Mr. Hossien Sadrizadeh(Slides 3 to 55) Mr. Houshyar Mohammadi Talvar(Slides 57 to 78)
The Date :June 21th 2012 , (On Thursday , Tir 1th 1391 )
1/79
Discovering Overlapping Groups in social Media
Introduction• The following sites are attractive Social media sites, they
have more user than ever:• Facebook• Twitter• Wikipedia• Blogger• Myspace
• In 2009, the global time spent on social media sites increased by 82% than the year before.
• Facebook, one of the most popular social media site, has more than 500 million active users and the number is still increasing.
3/79
Discovering Overlapping Groups in social Media
Introduction(Continue)• What kind of activities do the people in the social
media?• In social media websites, users are allowed to partrticipate
in social activities, for example:• Connect to the other likeminded people.• Updating their status.• Posting blogs.• Uploading photos.• Bookmark and tags.
• People can join to groups at different websites,for instance:• Fans of sports teams can join dedicated groups.• They can share their opinions on team performance.• Put comment on the newest information about player.
4/79
Discovering Overlapping Groups in social Media
Group - Community• A group (community) can be considered as a set of
users where each user interacts more ferquenly with users within the group than users outside the groups.
• Some social media websites(Flicker,Youtube) provide explicit groups which allow users to join them.
• Some dynamic sites(Twitter,Delicious)have no clear group structure in it, then we need to discover community detection between them.
5/79
Discovering Overlapping Groups in social Media
Group - Community
• In social media , a community is:• A group of people who are more similar with people
within the group than people outside this group.
• Homophily is one of the important reasons that people connected with others.for example:• People from the same city talk more frequently.• People have similar political viewpoints are more likely
to vote for the same candidates.• The people who watch the same movies because of the
commonly liked movie stars.
6/79
Discovering Overlapping Groups in social Media
why group?
• Group-level investigation can provide usesful information.
• Studying individual behaviour is usually difficult for large population.
• Studying statistics at website level often fail to catch sufficient detail.
7/79
Discovering Overlapping Groups in social Media
An example to make groups1
• We have a set of 50 people.we want to make two sets, with the following properties:
• Make a set whose the first letter’s name’s is “J”.• Make the second set whose the first letter’s name’s is “W”.
1. An example who I make it(The Presenter – hossien sadrizadeh)
http://en.wikipedia.org/wiki/Partition_of_a_set
8/79
Discovering Overlapping Groups in social Media
Another example to make groups• We have a set of 50 people.we want to make
two sets, with the following properties:
• Make a set whose the first letter’s name’s is “J”.• Make the second set whose the last letter’s name’s is “W”.
Adaptation And Enhancement Of Evaluation Measure To Overlapping Graph Clustering(Tatiana Gossen, Michael Kotzyba,)
9/79
Discovering Overlapping Groups in social Media
Overlapping - Introduce• The multiple interactions in social activities imply that
the community structures are often overlapping.• Example: one person is in several communications.
• We have a new idea to take advantage network information between users and tags in social media and discover these overlapping communications with co-clustering.
• Co-clustering is a way to obtain this kind of community structure.
10/79
Discovering Overlapping Groups in social Media
Overlapping
• When a website have an explicit group, and allowed to the users than join to more than one group base on their personal pereferences then overlapping is take place.
• When there are no explicit groups available, community detection algorithm can be used to obtain such groups.
11/79
Discovering Overlapping Groups in social Media
Community detection• Community detection are usually base on structureal
features(links).
• A sketch of a small network displaying community structure, with three groups of nodes with dense internal connections and sparser connections between groups.
12/79
Discovering Overlapping Groups in social Media
Co-Clustering• The graph that is on the right has two
type of nodes:
• Vertices u1-u5 on the left for users.
• T1-t4 on the right for tags.
• Edges for tag subscription relation between users and tags.
• If we use a method to make two cluster,then we’ll see that u3 is associated with two cluster.
13/79
Discovering Overlapping Groups in social Media
Co-Clustering
• There are two method of clustering:• Vertices clustering.• Edges clustering.
• Instead of clustering vertices , use of clustering edges is better.
• Clustering edges usually achieves overlapping communities.
14/79
Discovering Overlapping Groups in social Media
Our contribution1
• We propose to discover overlapping communities in social network.
• We use user-tag subscription information instead of user-user links.
• We obtain clusters containing users and tags simultaneousely.
1 research team
15/79
Discovering Overlapping Groups in social Media
Co-Clustering
In this graph , edges connecting to nodes t1,t2 and t3,t4 are clusterd into two separate groups both containing user u3.
16/79
Discovering Overlapping Groups in social Media
Community – Mathematical Defination
• Supose:
• A community Ci ( 1 i k ) is a subset of users and tags, where k is the number of community.
• Communities are usually overlap, Ci Cj .
• We use an adjacncy matrix to represent the relation between user their subscribed tag.(sparce matrix)
17/79
Discovering Overlapping Groups in social Media
User-Tag Network• In a user-tag network, each edge is associated with a uservertix ui and a tag vertix
tp.
• We can use of incidenc matrix.each vector in this matrix have Nu + Nt.(Nu for users and Nt for tags).
• For example :the edge between u1 and t1 in the followin graph is:
19/79
Discovering Overlapping Groups in social Media
Why is the incidence matrix useful?
The incidence matrix
• It is a sparse matrix.• We can impliment it with a linked list,(or double linked list).
21/79
Discovering Overlapping Groups in social Media
Overlapping co-clustering problem• The overlapping co-clustering problen can be
stated formally as follows:
Input:• A user-tag subscription matrix NNu Nt
. when Nu and Nt are the numbers of users and tags,respectively.
• K is the number of communities.
Output:• K overlapping communities which consist of both users
and tags.
22/79
Discovering Overlapping Groups in social Media
The Co-Clustering Framework
• A user is usually has several friendship but, only a link is usually related to one community ,then we understand to use of cluster edges instead of nodes.
• After obtaining edge clusters, communities can be recovered by replacing each edge with its two vecrtices, i.e., a node is in a community, if any of its connection is in the community.
• Then the obtained communities are often highly overlapping.
23/79
Discovering Overlapping Groups in social Media
Make Categories - Find Clusters
Communities that aggregate similar users and tags together can be detected by maximizing intra-cluster similarity, which is shown in the following equation: (this formulation can be solved by k-mean Algorithm).
24/79
Discovering Overlapping Groups in social Media
Disadvantage of k-means cluster
• K-means isn’t efficent for large scale data set.
• Then, What should we do ?
• Our propose1 is use of another type of k-means . That is EdgeCluster and it is efficent, which is a scalable algorithm to extract communities for sparse network.
• Why is the Edgecluster efficent ? Because :• each centroid only compare to a small set of edges that are
correlated to the centroid.• It is reported to be able to cluster a sparse network with more than
one million nodes into thousands of clusters in tens of minues.
1 Writers
25/79
Discovering Overlapping Groups in social Media
DensityThe expected density of the user-tag network is shown in the following equation :
26/79
Discovering Overlapping Groups in social Media
Key Step in Clustering edge• Define edges similarity .
• Given two edges : e(ui , tp) and e’(uj , tq) in a user tag graph, the similarity between them can be define with the following equation :
27/79
Discovering Overlapping Groups in social Media
Similarity Schemes for clustering
• There are 3 similarity schemes:• Independent Learning.• Normalized Learning.• Correalational Learning.
• Our framework1 can cover different similarity shemes.
1.Writer
28/79
Discovering Overlapping Groups in social Media
The Kronecker delta function
Independent Learning• A public way is use
of the similarity.
• The similarity can be represent by the following function : (the user Similarity can be define at the same way )
29/79
Discovering Overlapping Groups in social Media
Independent Learning – Cosine Similarity(Continue)
• The cosine similarity is widely used in measuring the similarity between two vectors.It’s define with the following form.
• Given two edges e(ui ,tp) and e’(uj ,tq), the cosine similarity can be define with the following equation:
30/79
Discovering Overlapping Groups in social Media
Independent Learning – Cosine Similarity(Continue)
• The cosine of two vectors can be easily derived by using the Euclidean dot product formula:
Given two vectors of attributes, A and B, the cosine similarity, , is represented using a dot product :
31/79
Discovering Overlapping Groups in social Media
An example of cosine similarity• If we have two following vectors, the similarity is :
=(1,2,3) and =(2,5,-3)
• What do you think about the range of the similarity?
• The resulting similarity ranges is in [-1,1].• −1 meaning exactly opposite.• 1 meaning exactly the same.• 0 usually indicating independence.
32/79
Discovering Overlapping Groups in social Media
Text matching1
• The attribute vectors A and B are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of normalizing document length during comparison.
• In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.
1 . Co-clustering documents and words using Bipartite Spectral Graph Partitioning (S. Dhillon)
33/79
Discovering Overlapping Groups in social Media
Normalized Learning• Let dui
, denoted the degree of the user ui , and dtp
represent the degree of tag tp in a user-tag network.
• After do the normalization , edge e(ui ,tp), can be represented by the following form :
1 The Research team
34/79
Discovering Overlapping Groups in social Media
Normalized Learning(continue)• If given two edges e(ui ,tp), and e’(uj,tq), the
cosine similarity between them after normalization can be written the following equation :
35/79
Discovering Overlapping Groups in social Media
Normalized Learning(continue)
• if we set by 0.5 , then we can derive the following equation that tell us normalized edge similarity.
• These formula say that the similarity between two users is not only related to users, but also the tags.
36/79
Discovering Overlapping Groups in social Media
Correlational Learning• Users often use more than one tag to describe the main topic
of a bookmark.
• A grouped tags indicates their Correlation.
• In a user-tag network :• At the first side, a user can be viewed as a vector by treating tags as
features.• At the other side , a tag can also be viewed as a vector by treating
by users as features.
• we use a latent space to represent the users and display correlation between their tags.
37/79
Discovering Overlapping Groups in social Media
Correlational Learning(Continue)
• Let’s take the following basis vector in the orthogonal latent’s axsis :
• Users vectors in the original space can be mapped to new vectors in the latent space, which is shown like this :
M is a linear mapping from the original space to the latent space
38/79
Discovering Overlapping Groups in social Media
Correlational Learning(Continue)
• We mapped the real vectors from a real space to the Latent space.like this : (we use a Mapped function)
39/79
Discovering Overlapping Groups in social Media
Correlational Learning(Continue)
• Another method to select a set of orthogonal basis is Singular Value Decomposition(SVD).
• The singular value decomposition for a user-tag network M is given by the following formula:
40/79
Discovering Overlapping Groups in social Media
Correlational Learning(Continue)
• User latents can be formulated with the following form :
• We need only a small set of vector to comput them.it is here:
41/79
Discovering Overlapping Groups in social Media
Correlational Learning(Continue)
• User similarity and tag similarity are defined by the following formula in the latent space :
• Z solved and derived the generalized eigenvectors.
42/79
Discovering Overlapping Groups in social Media
Correlational Learning(Continue)
• The Adjacency matrix and Laplacian matrix are :
43/79
Discovering Overlapping Groups in social Media
Correlational Learning(Continue)
• The generalized eigenvector can be rewritten by :
• After simple manipulation , we obtain :
44/79
Discovering Overlapping Groups in social Media
Singular Value Decomposition – SVD1,2
• SVD is base on theorem from linear algebra which says that, a rectangular matrix A can be broken down into the product of three matrices:• An orthogonal matrix U.• A diagonal matrix S.• Transpose of an orthogonal materix V.
• Gram-Schmidt orthogonalization process.• Is a method for converting a set of vectors into a set of
orthonormal vectors.• It uses of normalization method.
1.Linear Algebra, Haffman Kenneth. (chapter 8 : vector spaces)2.Numerical Analysis, Samuel D.Conte.(Chapter 4 : Matrixes and eigen values, eigen vectors )
45/79
Discovering Overlapping Groups in social Media
Singular Value Decomposition – SVD
• The theorem is usually presented with a formula like this :
47/79
Discovering Overlapping Groups in social Media
Example - SVD
• Start with the matrix:
• To find the U we have to find AAT.
48/79
Discovering Overlapping Groups in social Media
Example – SVD(Continue)• Next , we have to find the eigenvalues and
corresponding eigenvectors of AAT.
• If we find the eigenvectors and store in a matrix order by size of the corresponding of eigenvalue.
49/79
Discovering Overlapping Groups in social Media
Example – SVD(Continue)
• Finally,we have to convert this matrix into an orthogonal matrix.
50/79
Discovering Overlapping Groups in social Media
Example – SVD(Continue)
• We use a similar method to find V, base on ATA.
• Find the eigenvalues for ATA.
51/79
Discovering Overlapping Groups in social Media
Example – SVD(Continue)
• For all of data we have the following vectors:
• According to the size of eigenvalue, we have:
52/79
Discovering Overlapping Groups in social Media
Example – SVD(Continue)• After orthonormalization process, and the convert
that to an orthogonal matrix.
53/79
Discovering Overlapping Groups in social Media
Example – SVD(Continue)• For S we take the Square roots of the non-zero eigenvalues
and populate with them,putting the largest in S11, the next largest in S22 and so on.the smallest value in Smm.
• The non-zero eigenvalues of U and V are the same.• The diagonal entries in Sare the singular values of A.• The columns in U are called left singular vectors.• The columns in V are called right singular vectors.
54/79
Discovering Overlapping Groups in social Media
Example – SVD(Continue)• Now we have the following matrixes:
55/79
Discovering Overlapping Groups in social Media
Mr.Houshyar Mohammadi Talvarslides from 57 to 78
56/79
Discovering Overlapping Groups in social Media
SYNTHETIC DATA AND FINDINGS
Clustering evaluation is difficult when there is no ground truth.We first introduce the synthetic data and how they are generated, then the clustering quality measurement Normalized Mutual Information (NMI). Finally, the NMI of different clustering methods are reported.
We develop a synthetic data generator that allows input of the numbers of clusters, users and tags. First users and tags are split evenly into each cluster. Then, in each cluster users and tags are randomly connected with a specified density (e.g., 0.8).
Synthetic Data Generation
57/79
Discovering Overlapping Groups in social Media
Figure 2, shows a toy example of the synthetic user-tag graph in which users are labeled as u1−u7 and tags t1−t8. Three overlapping clusters are highlighted with different colors.
58/79
Discovering Overlapping Groups in social Media
NMI Evaluation in Synthetic Data
The Normalized Mutual Information (NMI) is commonly used to measure the clustering quality.Given two clusterings X and Y, the NMI is defined below.
The NMI is computed in two steps
First, find the pairs of clusters that are most close to each other in two clusterings
Second, average the mutual information between those pairs of clusters
The higher the NMI value is, the more similar between two clusterings. If two clusterings X and Y are exactly the same, the NMI value is 1.
59/79
Discovering Overlapping Groups in social Media
NMI and Number of Clusters
We generate another data set with 1,000 users and 1,000 tags and with different number of clusters which range from 5 to 50 and cluster density is set to 1 such that allusers connect to all tags within each cluster.
Figure 3. NMI Performance w.r.t Number of Clusters
60/79
Discovering Overlapping Groups in social Media
NMI and Link Density
We also study how intra-cluster link density affects clustering in synthetic data sets. We created synthetic data sets (50 clusters, 1,000 users and 1,000 tags) with different intracluster densities that range from 0.1 to 1.
Figure 4. NMI Performance w.r.t Intra-cluster Link Density
61/79
Discovering Overlapping Groups in social Media
Figure 3. NMI Performance w.r.t Number of Clusters
Figure 4. NMI Performance w.r.t Intra-cluster Link Density
View Correlational Learning in Figure 3 & figure 4
62/79
Discovering Overlapping Groups in social Media
SOCIAL MEDIA DATA AND FINDINGS
BlogCatalog is a social blog directory where the bloggers can register their blogs under predefined categories.We crawled user names, user ids, their friends, blogs, theassociated tags and blog categories.
Delicious is a social bookmarking website, which allows users to tag, manage, and share online resources (e.g.,articles). For each resource, users are asked to provideseveral tags to summarize its main topic.
63/79
Discovering Overlapping Groups in social Media
Interplay between Link Connection and Tag Sharing
There exist explicit and implicit relations between users.Examples of explicit relations are friends or fans people choose to be. Examples of implicit relations are tag sharing,i.e., people who use the same tags.
Are there any correlation between the two different relations? What drives people connect to others? Is it a random operation? We conducted statistical analysis between user-user links and tag sharing.
In the first study, we fix users who have or have no connection with others, then show the tag sharing probabilities.
64/79
Discovering Overlapping Groups in social Media
Figure 5 shows the tag sharing probabilities in BlogCatalog and Delicious data sets. For Delicious data, the friends network and fans network are evaluated separately.
Interplay between Link Connection and Tag Sharing(countinu)
Figure 5. X-axis represents the number of tags that two users share.
65/79
Discovering Overlapping Groups in social Media
Figures 6 and 7 are the probability that two users being connected if they share tags in BlogCatalog and Delicious,respectively. In Figure 6, the probability of a link between two users increases with respect to the number of tags they share
Figure 6. Link probability w.r.t tag sharing in BlogCatalog
Interplay between Link Connection and Tag Sharing(countinu)
66/79
Discovering Overlapping Groups in social Media
Figure 7. Link probability w.r.t tag sharing in Delicious
Interplay between Link Connection and Tag Sharing(countinu)
67/79
Discovering Overlapping Groups in social Media
Clustering Evaluation
The clustering evaluation consists of three studies:
1. First,cross-validation is performed to demonstrate the effectiveness of different clustering algorithms in BlogCatalog data set.
2. we study the correlation between user connectivity and co-occurrence in extracted communities
3. concrete examples illustrate what clusters are about.
68/79
Discovering Overlapping Groups in social Media
1) Comparative Study: In BlogCatalog, categories for each blog are selected by the blog owner from a predefined list.With category information, certain procedures such as cross validation (e.g., treating categories as class labels,cluster memberships as features) can be used to show the clustering quality.
Linear SVM is adopted in our experiments since it scales well to large data sets. As recommended by Tang et al, 1,000 communities are used in our experiments. We vary the fraction of training data from 10% to 90% and use the rest as test data.
This experiment is repeated for 10 times and the average Micro-F1 and Macro-F1 measures are reported.
69/79
Discovering Overlapping Groups in social Media
Table II shows five different clustering methods and their prediction performance. In this table, the fourth algorithm EdgeCluster uses user-user network rather than the usertag network. Dhillon’s co-clustering algorithm is based on Singular Value Decomposition (SVD) of the normalized user-tag matrix.
As shown in Table II, Correlational Learning consistently performs better, especially when the training set is small. According to Table II, normalization does not improve performance. This suggests normalization should be taken cautiously. Dhillon’s co-clustering method which can only deal with non-overlapping clustering does not perform well compared to other methods.
70/79
Discovering Overlapping Groups in social Media
2) Connectivity Study: We study the correlation between user co-occurrence in extracted communities and the actual social connections between themWe also study the connectivity between users who are in the top similar list. 1,000 overlapping communities are extracted by Correlational Learning.
71/79
Discovering Overlapping Groups in social Media
We study the dis-connectivity between users who are most similar. Figure 8 shows that the probability of being disconnected is higher than 96% and 99% in BlogCatalog and Delicious, respectively, which means that the majority of homogeneous users are not connected in actual social networks.
For example, users marama6 and ameer1577 both are interested in the online game “World of Warcraft”.
Figure 8. Probability being Dis-connected between Top Similar Users 72/79
Discovering Overlapping Groups in social Media
3) Illustrative Examples: Health is the second largest category (the largest is personal) in BlogCatalog, a hot topic that attracts lots of cares.
73/79
Discovering Overlapping Groups in social Media
The largest cluster about Health obtained by Correlational Learning is cluster-health with 127 users and 102 tags. The cluster that has the maximum user overlapping with clusterhealth is cluster-nutrition with 83 users and 25 tags. Their tag clouds are shown in Figures 10 and 11. Between the two clusters, there are 18 users and 3 tags health, nutrition and weight loss in common. Both clusters are related to health but the first has an emphasis on physical health, highlighted by tags arthritis, drugs, food, dentist, and the second is more about nutrition.
74/79
Discovering Overlapping Groups in social Media
The top 102 tags of categoryhealth are compared to the tags of cluster-health and the top25 tags of category-health to those of cluster-nutrition. The numbers of shared tags are 16 for cluster-health and 9 for cluster-nutrition.
75/79
Discovering Overlapping Groups in social Media
In addition, we aggregate tags of the users in cluster health and present the most frequent 102 tags in Figure 12. Comparing these tags with those of cluster-health, 40 tags are in common. Many tags such as environment, humor, jokes are not present in the tag cloud of cluster-health, which suggests that these users actually have other interests besides health. A similar pattern is observed for cluster nutrition.
76/79
Discovering Overlapping Groups in social Media
CONCLUSIONS AND FUTURE WORK
This study suggests more interesting problems that are worth further exploring. Formulating the co-clustering problem into an objective function and maximizing it is one direction to work on.
We proposed a framework to study the overlapping clustering of users and tags in online social media which helps to understand the major concerns within the groups. Experimental results in synthetic data reveal that Correlational Learning isvery effective in recovering the overlapping cluster structures even when the inner cluster density is low.
78/79