Community Detection and Behavior Study for Social Computing

download Community Detection and Behavior Study for Social Computing

If you can't read please download the document

description

Community Detection and Behavior Study for Social Computing. Huan Liu + , Lei Tang + , and Nitin Agarwal * + Arizona State University * University of Arkansas at Little Rock. Updated slides available at http://www.public.asu.edu/~ltang9/ - PowerPoint PPT Presentation

Transcript of Community Detection and Behavior Study for Social Computing

Community Detection and Behavior Study for Social Computing

Community Detection and Behavior Study for Social ComputingHuan Liu+, Lei Tang+, and Nitin Agarwal*+Arizona State University*University of Arkansas at Little Rock

Updated slides available at http://www.public.asu.edu/~ltang9/ http://www.public.asu.edu/~huanliu/1AcknowledgementsWe would like to express our sincere thanks to Jianping Zhang, John J. Salerno, Sun-Ki Chai, Xufei Wang, Sai Motoru and Reza Zafarani for collaboration, discussion, and valuable comments.This work derives from the projects, in part, sponsored by AFOSR and ONR grants.Some materials presented here can be found in the following book chapters and references section of this tutorial:Lei Tang and Huan Liu, Graph Mining Applications to Social Network Analysis, in Managing and Mining Graph Data (forthcoming)Lei Tang and Huan Liu, Understanding Group Structures and Properties in Social Media, in Link Mining: Models, Algorithms and Applications (forthcoming)If you wish to use the ppt version of the slides, please contact (or email) us. The ppt version contains more comprehensive materials with additional information and notes and many animations. OutlineSocial Media Data Mining TasksEvaluation

Principles of Community Detection Communities in Heterogeneous Networks Evaluation Methodology for Community Detection

Behavior Prediction via Social DimensionsIdentifying Influential Bloggers in a CommunityA related tutorial on Blogosphere

Social Computing definitionGeneral: Social computing is concerned with the study of social behavior and social context based on computational systems.Narrower definition: Data Mining with Social Media3Participating Web and Social Media

Traditional Media

Broadcast Media: One-to-ManyCommunication Media: One-to-OneSocial Media: Many-to-Many

There are various of social media: like social network , blogs, wiki, forums content sharing. Most of them are an open platform so users can connect with each other, generate content. These web 2.0 applications gained a lot of online traffic recently. They provide rich information about online users, also new tasks and challenges.

6Traditional vs. Social MediaTraditional MediaBroadcast mediaOne-to-ManyRadioTelevisionNewspapersMoviesCommunication MediaOne-to-OnePhone callsFaxesInstant MessageSocial MediaMySpaceFacebookTwitterBlogMany-to-ManyWeb 2.0 Services (examples)BlogsBlogspotWordpressWikisWikipediaWikiversitySocial Networking SitesFacebookMyspaceOrkutDigital media sharing websitesYoutubeFlickrSocial TaggingDel.icio.usOthersTwitterYelp

WordPress Express yourself

Twitter is a service for friends, family, and coworkers tocommunicate and stay connected through the exchange of quick, frequent answers to one simple question: What are you doing?

Yelp is the fun and easy way to find, review and talk about whats great8Characteristics of Social MediaEveryone can be a media outletDisappearing of communications barrierRich User InteractionUser-Generated ContentsUser Enriched ContentsUser developed widgetsCollaborative environmentCollective WisdomLong Tail

Broadcast MediaFilter, then PublishSocial MediaPublish, then FilterTop 20 Most Visited WebsitesInternet traffic report by Alexa on July 29th 2008

40% of the top 20 websites are Web 2.0 sites1Yahoo!11Orkut2Google12RapidShare3YouTube13Baidu.com4Windows Live14Microsoft Corporation5Microsoft Network15Google India6Myspace16Google Germany7Wikipedia17QQ.Com8Facebook18EBay9Blogger19Hi510Yahoo! Japan20Google FranceRoughly 60 million American visit Wikipedia every month.

Since Michael Jackson died on June 25, the wikipedia article about him has been viewed more than 30 million times. With 6 million of those in the first 24 hours.10Top 20 Most Visited WebsitesInternet traffic report by Alexa on August 27th, 2009

40% of the top 20 websites are social media sites1Google11MySpace2Yahoo!12Google India3Facebook13Google Germany4YouTube14Twitter5Windows Live15QQ.Com6Wikipedia16RapidShare7Blogger17Microsoft Corporation 8Microsoft Network (MSN)18Google France9Baidu.com19WordPress.com 10Yahoo! Japan20Google UKRoughly 60 million American visit Wikipedia every month.

Since Michael Jackson died on June 25, the wikipedia article about him has been viewed more than 30 million times. With 6 million of those in the first 24 hours.

Some web sites are becoming social (Baidu.com, QQ.com, 11Social Medias Important Role

"social networks will complement, and may replace, some government functions,"

http://www.gartner.com/it/page.jsp?id=78421212Social Networks and Data Mining

Social NetworksA social structure made of nodes (individuals or organizations) that are related to each other by various interdependencies like friendship, kinship, etc.Graphical representationNodes = membersEdges = relationshipsVarious realizationsSocial bookmarking (Del.icio.us)Friendship networks (facebook, myspace)BlogosphereMedia Sharing (Flickr, Youtube)Folksonomies

Folksonomy (also known as collaborative tagging, social classification, social indexing, and social tagging) is the practice and method of collaboratively creating and managing tags to annotate and categorizecontent.14Sociomatrix12345678910111213101110001100002100010000000031000000000000

Social networks can also be represented in matrix formSocial Computing and Data MiningSocial computing is concerned with the study of social behavior and social context based on computational systems.Data Mining Related TasksCentrality AnalysisCommunity DetectionClassificationLink PredictionViral MarketingNetwork Modeling

Centrality Analysis/Influence StudyIdentify the most important actors in a social networkGiven: a social networkOutput: a list of top-ranking nodes

Top 5 important nodes: 6, 1, 8, 5, 10(Nodes resized by Importance)

Based on betweenness: Vertices that occur on many shortest paths between other vertices have higher betweenness than those that do not.In other words, if taking out node 6, the average distance between any two nodes of the remaining network can increase. 17Community DetectionA community is a set of nodes between which the interactions are (relatively) frequenta.k.a. group, subgroup, module, clusterCommunity detectiona.k.a. grouping, clustering, finding cohesive subgroupsGiven: a social networkOutput: community membership of (some) actors ApplicationsUnderstanding the interactions between peopleVisualizing and navigating huge networksForming the basis for other tasks such as data mining

Visualization after Grouping

(Nodes colored by Community Membership)4 Groups:{1,2,3,5}{4,8,10,12}{6,7,11}{9,13}

Modularity19ClassificationUser Preference or Behavior can be represented as class labelsWhether or not clicking on an adWhether or not interested in certain topicsSubscribed to certain political viewsLike/Dislike a productGivenA social networkLabels of some actors in the networkOutputLabels of remaining actors in the network

20Visualization after Prediction

: Smoking: Non-Smoking: ? UnknownPredictions6: Non-Smoking7: Non-Smoking8: Smoking9: Non-Smoking10: SmokingLink PredictionGiven a social network, predict which nodes are likely to get connectedOutput a list of (ranked) pairs of nodesExample: Friend recommendation in Facebook

Link Prediction(2, 3)(4, 12)(5, 7)(7, 13)Viral Marketing/Outbreak DetectionUsers have different social capital (or network values) within a social network, hence, how can one make best use of this information?Viral Marketing: find out a set of users to provide coupons and promotions to influence other people in the network so my benefit is maximizedOutbreak Detection: monitor a set of nodes that can help detect outbreaks or interrupt the infection spreading (e.g., H1N1 flu)Goal: given a limited budget, how to maximize the overall benefit?

23An Example of Viral MarketingFind the coverage of the whole network of nodes with the minimum number of nodesHow to realize it an exampleBasic Greedy Selection: Select the node that maximizes the utility, remove the node and then repeat

Select Node 1Select Node 8Select Node 7Node 7 is not a node with high centrality!24Network ModelingLarge Networks demonstrate statistical patterns:Small-world effect (e.g., 6 degrees of separation)Power-law distribution (a.k.a. scale-free distribution)Community structure (high clustering coefficient)Model the network dynamicsFind a mechanism such that the statistical patterns observed in large-scale networks can be reproduced.Examples: random graph, preferential attachment processUsed for simulation to understand network propertiesThomas Shellings famous simulation: What could cause the segregation of white and black peopleNetwork robustness under attack He wondered if racial segregation might, in principle, have absolutely nothing to do with racism. First experiment racism can cause segregationSecond experiment racially tolerant people might prefer to avoid being part of extreme minority (< 30%)25Comparing Network Models

observations over various real-word large-scale networksoutcome of a network model(Figures borrowed from Emergence of Scaling in Random Networks)Last figure is a model of similar size to A.26Social Computing ApplicationsAdvertizing via Social Networking Behavior Modeling and PredictionEpidemic StudyCollaborative FilteringCrowd Mood ReaderCultural Trend MonitoringVisualizationHealth 2.0

Mining the Web for Feelings, Not Facts (an August NY Times article)27General Evaluation MeasuresBasic Evaluation and MetricsAssessment is an essential stepComparing with some ground truth if availableObviously, various tasks may require different ways of performance evaluationRankingClusteringClassification An understanding of these concepts will help us to develop more pertinent evaluation methods.Measuring a Ranked ListNormalized Discounted Cumulative Gain (NDCG)Measuring relevance of returned search resultMulti levels of relevance (r): irrelevant (0), borderline (1), relevant (2)Each relevant document contributes some gain to be cumulatedGain from low ranked documents is discountedNormalized by the maximum DCG

http://sifaka.cs.uiuc.edu/course/410s08/410notes.pdfWhy NDCG?Sensitive to the position of highest rated pageLog-discounting of resultsNormalized for different lengths lists

30NDCG - ExampleiGround TruthRanking Function1Ranking Function2Document OrderriDocument OrderriDocument Orderri1d42d32d322d32d42d213d21d21d424d10d10d10NDCGGT=1.00NDCGRF1=1.00NDCGRF2=0.9203

4 documents: d1, d2, d3, d4Comparing Two Ranked ListsRank correlationSpearmans rank correlation coefficient

Example = 1-(6*194/10*(102-1)) = -0.175

XiYirank xirank yididi28601100972026-416992838-5251002747-3910150510-5251032969-3910677341611017853911269274911312104636d is the difference or (x y)http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient32Concordance between a PairRank correlation[-1,1]: perfect agreement=1, perfect disagreement=-1Kendall tau rankcorrelation coefficient

Example

PersonABCDEFGHRank by Height12345678Rank by Weight34125786P = 5 + 4 + 5 + 4 + 3 + 1 + 0 + 0 = 22 = (4*22/8*7 )-1= (88/56)-1 = 0.57http://www.itl.nist.gov/div898/software/dataplot/refman2/auxillar/kendell.htmA term in P is the number of items that remain in agreement in the two comparing lists33Measuring a Classification ResultConfusion Matrix

Measures:

Prediction (+)Prediction (-)Truth (+)True Positive (tp)False Positive (fn)Truth (-)False Positive (fp)True Negative (tn)

+-Predicted +

F-measure Example

: Smoking: Non-Smoking: ? UnknownPredictions6: Non-Smoking7: Non-Smoking8: Smoking9: Non-Smoking10: SmokingTruth6: Smoking7: Non-Smoking8: Smoking9: Smoking10: SmokingTruth (+)Truth (-)Prediction (+) 2 (node 8, 10)0Prediction (-)2 (node 6, 9)1 (node 7)Accuracy = (2+1)/ 5 = 60%Precision = 2/(2+0)= 100%Recall = 2/(2+2) = 50%F-measure= 2*100% * 50% / (100% + 50%) = 2/3 Precision among predicted true, how many are really, predicted true (true positive)Recall among all real true, how many are really, predicted true (true positive)35Measuring a Clustering ResultThe number of communities after grouping can be different from the ground truthNo clear community correspondence between clustering result and the ground truth Normalized Mutual Information can be usedGround Truth1, 2, 33, 4, 51, 42, 53, 6Clustering ResultHow to measure the clustering quality?Normalized mutual information considers all the possible community matching between the ground truth and the clustering result36Normalized Mutual InformationEntropy: the information contained in a distribution

Mutual Information: the shared information between two distributions

Normalized Mutual Information (between 0 and 1)

Consider a partition as a distribution (probability of one node falling into one community), we can compute the matching between two clusterings

We normalize MI so that we can compare more than two clusterings, e.g., clusterings A, B, C37NMI

Pi_a, Pi_b denote different partitions. n_h^a is the number of nodes in Partition a assigned to h-th community;n_l^b is the number of nodes in Partition b assigned to l-th community;

n_h, l is the number of nodes assigned to h-th community in partitition a, and l-th community in partition b.

k^{(a)} is the number of communities in partition a, k^{(b)} is the number of communities in partition b. 38NMI-ExamplePartition a: [1, 1, 1, 2, 2, 2] Partition b: [1, 2, 1, 3, 3, 3]

1, 2, 34, 5, 61, 324, 5,6

h=13h=23

l=12l=21l=33

l=1l=2l=3h=1210h=2003

=0.8278OutlineSocial Media Data Mining TasksEvaluation

Principles of Community Detection Communities in Heterogeneous Networks Evaluation Methodology for Community Detection

Behavior Prediction via Social DimensionsIdentifying Influential Bloggers in a CommunityA related tutorial on Blogosphere

Social Computing definitionGeneral: Social computing is concerned with the study of social behavior and social context based on computational systems.Narrower definition: Data Mining with Social Media40Principles of Community Detection

41CommunitiesCommunity: subsets of actors among whom there are relatively strong, direct, intense, frequent or positive ties.-- Wasserman and Faust, Social Network Analysis, Methods and ApplicationsCommunity is a set of actors interacting with each other frequently e.g. people attending this conferenceA set of people without interaction is NOT a community e.g. people waiting for a bus at station but dont talk to each other

People form communities in Social Media

42Example of Communities

Communities from FacebookCommunities from FlickrWhy Communities in Social Media?Human beings are socialPart of Interactions in social media is a glimpse of the physical world People are connected to friends, relatives, and colleagues in the real world as well as onlineEasy-to-use social media allows people to extend their social life in unprecedented waysDifficult to meet friends in the physical world, but much easier to find friend online with similar interests

Community DetectionCommunity Detection: formalize the strong social groups based on the social network properties Some social media sites allow people to join groups, is it necessary to extract groups based on network topology?Not all sites provide community platformNot all people join groupsNetwork interaction provides rich information about the relationship between usersGroups are implicitly formedCan complement other kinds of informationHelp network visualization and navigationProvide basic information for other tasksGroups are not active: the group members seldom talk to each other

Explicitly vs. implicitly formed groups

45Subjectivity of Community Definition

Each component is a communityA densely-knit community Definition of a community can be subjective.No objective definition for a communityVisualization might help, but only for small networksReal-world networks tend to be noisyNeed a proper criteria

46Taxonomy of Community Criteria Criteria vary depending on the tasksRoughly, community detection methods can be divided into 4 categories (not exclusive): Node-Centric CommunityEach node in a group satisfies certain properties Group-Centric CommunityConsider the connections within a group as a whole. The group has to satisfy certain properties without zooming into node-levelNetwork-Centric CommunityPartition the whole network into several disjoint setsHierarchy-Centric Community Construct a hierarchical structure of communities

47Node-Centric Community DetectionNode-Centric Community DetectionNodes satisfy different propertiesComplete Mutuality cliquesReachability of membersk-clique, k-clan, k-clubNodal degrees k-plex, k-coreRelative frequency of Within-Outside TiesLS sets, Lambda setsCommonly used in traditional social network analysisHere, we discuss some representative onesComplete Mutuality: CliqueA maximal complete subgraph of three or more nodes all of which are adjacent to each otherNP-hard to find the maximal cliqueRecursive pruning: To find a clique of size k, remove those nodes with less than k-1 degrees

Very strict definition, unstableNormally use cliques as a core or seed to explore larger communities

GeodesicReachability is calibrated by the Geodesic distanceGeodesic: a shortest path between two nodes (12 and 6)Two paths: 12-4-1-2-5-6, 12-10-612-10-6 is a geodesicGeodesic distance: #hops in geodesic between two nodese.g., d(12, 6) = 2, d(3, 11)=5Diameter: the maximal geodesic distance for any 2 nodes in a network#hops of the longest shortest path

Diameter = 56 degrees of separation is the average distance, not the maximum one.51Reachability: k-clique, k-clubAny node in a group should be reachable in k hopsk-clique: a maximal subgraph in which the largest geodesic distance between any nodes Outlinks > Blog post length

OutlineSocial Media Data Mining TasksEvaluation

Principles of Community Detection Communities in Heterogeneous Networks Evaluation Methodology for Community Detection

Behavior Prediction via Social DimensionsIdentifying Influential Bloggers in a CommunityA related tutorial on Blogosphere

Social Computing definitionGeneral: Social computing is concerned with the study of social behavior and social context based on computational systems.Narrower definition: Data Mining with Social Media137References: Social Computing Tang, L.; Liu, H.; Zhang, J. &Nazeri, Z. (2008), Community evolution in dynamic multi-mode networks, in 'KDD '08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining', ACM, New York, NY, USA, pp. 677--685.Tang, L.; Liu, H.; Zhang, J.; Agarwal, N. & Salerno, J. J. (2008), 'Topic taxonomy adaptation for group profiling', ACM Trans. Knowl. Discov. Data1(4), 1--28.Liben-Nowell, D. & Kleinberg, J. (2007), 'The link-prediction problem for social networks', J. Am. Soc. Inf. Sci. Technol.58(7), 1019--1031.Newman, M. (2005), 'Power laws, Pareto distributions and Zipf's law', Contemporary physics46(5), 323--352.Richardson, M. &Domingos, P. (2002), Mining knowledge-sharing sites for viral marketing, in 'KDD', pp. 61-70.Barabsi, A.-L. & Albert, R. (1999), 'Emergence of Scaling in Random Networks', Science286(5439), 509-512.Travers, J. &Milgram, S. (1969), 'An Experimental Study of the Small World Problem', Sociometry32(4), 425-443.

Return to Menu

References: Community DetectionFlake, G. W.; Lawrence, S. & Giles, C. L. (2000), Efficient identification of Web communities, in 'KDD '00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining', ACM, New York, NY, USA, pp. 150--160.Fortunato, S. &Barthelemy, M. (2007), 'Resolution limit in community detection', PNAS104(1), 36--41.Gibson, D.; Kumar, R. & Tomkins, A. (2005), Discovering large dense subgraphs in massive graphs, in 'VLDB '05: Proceedings of the 31st international conference on Very large data bases', VLDB Endowment, , pp. 721--732.Handcock, M. S.; Raftery, A. E. & Tantrum, J. M. (2007), 'Model-based clustering for social networks', Journal Of The Royal Statistical Society Series A127(2), 301-354.Hoff, P. D. & Adrian E. Raftery, M. S. H. (2002), 'Latent Space Approaches to Social Network Analysis', Journal of the American Statistical Association97(460), 1090-1098von Luxburg, U. (2007), 'A tutorial on spectral clustering', Statistics and Computing17(4), 395--416.

Return to MenuReferences: Community DetectionNewman, M. (2006), 'Modularity and community structure in networks', PNAS103(23), 8577-8582.Newman, M. (2006), 'Finding community structure in networks using the eigenvectors of matrices', Physical Review E (Statistical, Nonlinear, and Soft Matter Physics)74(3).Newman, M. & Girvan, M. (2004), 'Finding and evaluating community structure in networks', Physical Review E69, 026113.Nowicki, K. &Snijders, T. A. B. (2001), 'Estimation and Prediction for Stochastic Blockstructures', Journal of the American Statistical Association96(455), 1077-1087.Sarkar, P. & Moore, A. W. (2005), 'Dynamic social network analysis using latent space models', SIGKDD Explor. Newsl.7(2), 31--40.Shi, J. &Malik, J. (1997), Normalized Cuts and Image Segmentation, in 'CVPR '97: Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)', IEEE Computer Society, Washington, DC, USA, pp. 731.White, S. & Smyth, P. (2005), A spectral Clustering Approaches To Finding Communities in Graphs, in 'SDM'.

Return to MenuThank You!

Please feel free to contact Lei Tang ([email protected]) if you have any questions!