Finding Endogenously Formed Communitiesninamf/papers/communities.pdfarXiv:1201.4899v2 [cs.DS] 1 Mar...

arX

iv:1

201.

4899

v2 [

cs.D

S]

1 M

ar 2

012

Finding Endogenously Formed Communities

Maria-Florina Balcan∗ Christian Borgs† Mark Braverman‡ Jennifer Chayes§

Shang-Hua Teng¶

March 2, 2012

Abstract

A central problem in e-commerce is determining overlappingcommunities (clusters) among indi-viduals or objects in the absence of external identificationor tagging. We address this problem byintroducing a framework that captures the notion of communities or clusters determined by the relativeaffinities among their members. To this end we define what we call an affinity system, which is a set ofelements, each with a vector characterizing its preferencefor all other elements in the set. We define anatural notion of (potentially overlapping) communities in an affinity system, in which the members of agiven community collectively prefer each other to anyone else outside the community. Thus these com-munities are endogenously formed in the affinity system and are “self-determined” or “self-certified” byits members.

We provide a tight polynomial bound on the number of self-determined communities as a functionof the robustness of the community. We present a polynomial-time algorithm for enumerating thesecommunities. Moreover, we obtain a local algorithm with a strong stochastic performance guaranteethat can find a community in time nearly linear in the of size the community (as opposed to the size ofthe network).

Social networks and social interactions fit particularly naturally within the affinity system framework– if we can appropriately extract the affinities from the relatively sparse yet rich information from socialnetworks and social interactions, our analysis then yieldsa set of efficient algorithms for enumeratingself-determined communities in social networks. In the context of social networks we also connect ouranalysis with results about(α, β)-clusters introduced by Mishra, Schreiber, Stanton, and Tarjan [16, 17].In contrast with the polynomial bound we prove on the number of communities in the affinity systemmodel, we show that there exists a family of networks with superpolynomial number of(α, β)-clusters.

1 Introduction

Affinity Systems The problem of identifying endogenously1 formed overlapping communities or clustersarises in many contexts within e-commerce: finding overlapping communities in a social network, clustering

∗School of Computer Science, College of Computing, Georgia Institute of Technology, Atlanta, Georgia.†Microsoft Research, New England, Cambridge, MA.‡Princeton University and University of Toronto§Microsoft Research, New England, Cambridge, MA.¶Computer Science Department, University of Southern California.1endogenous: growing or developing from within; originating within.

1

http://arxiv.org/abs/1201.4899v2

retail products using collaborative filtering, clusteringdocuments using citation information, classifyingvideos using viewing logs, etc. In such settings one needs tocluster the set of objects into meaningful,potentially overlapping subsets by only using informationabout relations between the objects. In this paperwe develop the notion of an affinity system to model these scenarios.

An affinity system is a collection of elements with a set of “preferences” each of these elements has overother elements within the system. These preferences may be expressed as a vector of rankings, or, moregenerally, as a vector of non-negative weights representing affinities. For example, when clustering videos,affinities may represent the likelihood of the videos to be co-watched, with videos that are co-watched moreoften “ranking” each other higher. When clustering documents, a document will “prefer” documents it citesover documents it doesn’t.

Perhaps the most natural application of affinity systems is to the study of social networks. Social interactionis often determined by affinities among the members. For example, in daily life, we often stay more in touchwith people we like more. When we go to a conference, we often hang out more with people with whom weshare more interests. Therefore, these social interactions, and their manifestations as online social networksfit well within the affinity system paradigm.

Endogenously Formed Communities in Affinity SystemsA central question concerning groups of indi-viduals, documents, products, etc., is how to determine communities, oroverlappingclusters that capture thecoherence among their members. For example, in the context of retail products discussed above, it may beuseful to automatically “tag” the products with multiple categories for subsequent personalized marketing.In the context of professional networks, a person may belongto multiple explicit or implicit communities,for example a scientist may simultaneously belong to the community of Economists and the community ofComputer Scientist. The question of finding overlapping communities is closely related to the very wellstudied question of clustering [10], but is much more general, since now elements may (and will) belong tomultiple communities.

In this paper we formalize a natural notion of self-determined community and develop efficient algorithms toidentify overlapping communities of this type as well as general bounds on the number of such communities.Self-determined communities correspond to subsets that collectively prefer each other more than they preferthose outside the subset, where preference is defined by the rankings or weights of the affinity system.These communities are endogenously formed in the affinity system. What is particularly nice about thisformulation is that we do not require that the subsets be of pre-specified sizes. For example, a solution of theflexible capacity roommate problem would group together people who prefer living with each other to livingwith anyone else in another room. Switching to the context ofsocial and professional networks, an academiccommunity can be viewed as a group of scholars which appreciates the work of others in the communityto that of the work of people outside their community. In all these cases, the overlapping communities orclusters are self-certified or self-determined.

More formally, we study the mathematical structure of self-determined communities in an affinity systemand design efficient algorithms for discovering them. In ourmost basic model, we haven membersV =1, ..., n in an affinity system, and we assume each memberi states a strict rankingπi of all members inthe order of her preferences. To evaluate whether a subsetS of size|S| = k is a good community, imaginethat each members ∈ S casts a vote for each of itsk most preferred membersπs(1 : k). The number ofvotes that memberi receives,φS(i) = |i ∈ πs(1 : |S|)|s ∈ S|, is the collective preference given byS.We sayS is self-determinedif everyone inS receives more votes fromS than everyone outsideS.

Different self-determined communities may have differentdegree of coherence or robustness depending on

2

both the fraction of votes received by the community membersas well as the gap between the fraction ofvotes received by the community members and the non-community members. To capture this, we sayS is a(θ, α, β) self-determined community, for 0 ≤ β < α ≤ 1 andθ > 0 if

• each members ∈ S casts a vote for each of itsθ|S| most preferred membersπs(1 : θ|S|).

• for eachi ∈ S, the amount of votei receives,φ(θ)S (i) = |i ∈ πs(1 : θ|S|)|s ∈ S|, is at leastα|S|.

• for eachj 6∈ S, the amount of votej receives,φ(θ)S (j) = |j ∈ πs(1 : θ|S|)|s ∈ S|, is at mostβ|S|.

We start by analyzing how many communities can exist in an affinity system. Interestingly, we show that forconstantsα, β, θ we have a polynomial bound ofnO(log(1/α)/α) on the number of(θ, α, β)-self-determinedcommunities. Our analysis, using probabilistic methods, also yields a polynomial-time algorithm for enu-merating these communities. Moreover, we show that our bound is nearly tight, by exhibiting an affinitysystem withnΩ(1/α) (θ, α, β)-self-determined communities.

We then present a local community finding algorithm that is very efficient for an interesting range of pa-rameters. This algorithm, when given robustness parameters θ, α, β, and a memberv ∈ V , either returnsa (θ, α, β)-self-determined community of sizet in timeO(f(α, β, θ) · t log t) or an empty set. The algo-rithm satisfies the following performance guarantee: ifα > 1/2, if v is chosen uniformly at random froma (θ, α, β)-self-determined communityS, then with probabilityΩ(2α − 1), our local algorithm will suc-cessfully recoverS and so in time dependent only (and nearly) on the|S| and not on the size of the entireaffinity system. As a consequence of this analysis, we can show that in the (natural) case whenα > 1/2we obtain anear-linearalgorithm for finding all self-determined communities, substantially improving onthe polynomial-time guarantee discussed above. Quasi-linear local algorithms are particularly importantin the context of studying internet-scale networks, where even quadratic-time algorithms are not feasible,and where one sometime does not have access to the entire network but only to a local portion of it. Thequasi-linear algorithm is one of our main technical contributions, as its techniques can potentially be usedto convert other polynomial cluster-detection algorithmsinto local quasi-linear algorithms – at least in theaverage case.

We also studymulti-facet affinity systemswhere each member may have a number of different rankingsof other members. For example, memberi may have two rankingsπi,fun andπi,science, where first ranksmembers by how much funi thinks they are and the second ranks them according to academic affinity. Inthis context, we sayS is a self-determined community if there exists a vector of choices of rankings (in thiscase, infun, science|S|) such that if members vote according to their associated choice, the resultingvotes self-certifyS. We prove that if each member has a constant number of rankings, all our results can beextended, even though there could be exponential number of combinations of rankings.

Our results can be extended to weighted affinity systems where the affinities of each member are givenby a numerical weighting rather than just an ordinal ranking. For example, memberi may give her mostpreferred member weight1, next two preferred members weight0.7, next one weight0.5, and so on. Aweighted affinity system can be expressed asA = V, a1, ..., an, whereai is a n-dimensional vectorai = (ai,1, ..., ai,n) and0 ≤ ai,j ≤ 1 specifies the degree of affinity thati has forj. One can naturallydefine(θ, α, β)-self-determined communities for weighted affinity systems. The only requirement is thatmembers are only allowed to cast votes up to a total weight ofθt when voting for a community of sizet,while respecting the affinity system. We show that all our bounds and algorithmic results extend to weightedaffinity systems with only a slight loss in the parameters.

3

Endogenously Formed Communities in Social NetworksOur general formulation enables us to shedlight on the challenging task of defining and finding overlapping communities in social networks [19, 17,16, 14, 15]. Typically, a social network can be viewed as graph G = (V,E), where the edges could beeither undirected (e.g., the Facebook social network determined by friend list) or directed (e.g. the Twitternetwork). An edge could be unweighted or weighted (e.g., theSkype phone-call networks or the Facebooknetwork based on the number of times that one person writes onthe wall of others).

It turns out that a social network can be realized as a projection of an affinity system. Indeed, although ouraffinity systems are typically dense, their projections as social networks can be very sparse as are manyobserved social networks. We can think of our observed social network interactions as being induced (invarious ways) by the underlying latent set of affinities. To be precise, given a social networkG = V,W,wwith weightsw = (wij), we would like to recover the communities in the original affinity system. A naturalway to do this is to lift the social network back to an affinity systemA = V, a1, ..., an and then to solve theproblem in the lifted system. For example, several natural approaches for lifting based on different beliefsabout how the social network may have emerged from an underlying set of affinities include:

1. Direct Lifting: One can directly lift to an affinity system by definingai,j = wi,j if (i, j) ∈ E,otherwiseai,j = 0 (we assume WLOG thatwi,j ∈ [0, 1])).

2. Shortest Path Lifting: If G = (V,E) is an unweighted social network, and the shortest path distancefrom i to j is di,j, one may defineai,j = 1/di,j . The shortest path lifting can be extended to weightedcases by appropriated normalization.

3. Personal Page Rank Lifting: Let pi be the personal PageRank vector [2] of vertexi, we defineai,j =pi,j/max(pi).

4. Effective Resistance Lifting: Let ri,j be effective resistance of fromi to j by viewingG a network ofresistors, using1/w(e) as the resistance ofe ∈ E [9], we defineai,j = mink(ri,k/ri,j).

Each style of lifting corresponds to a particular belief on how this social network may have emerged from alatent underlying affinity system. For instance, Direct Lifting corresponds to the belief that a social networkG = (V,E), such as the Twitter network, arose from a latent affinityA′ = (V, a′1, ..., a′n) by a processin which each memberi connects to thedi top most elements according to the affinity systemA′. In otherwords,i follows j, i.e.,(i, j) is a directed edge in this social network, ifj is among thedi top most elementsof i according toA′. Similarly, one can think of Shortest Path Lifting as corresponding to the belief thatthe social network serves as an approximate spanner of the underlying affinity system [18], and EffectiveResistance Lifting corresponds to the belief that a social network is approximately based on some spectralsparsification of those underlying affinities [20].

Given a social network, once we derive a corresponding affinity systemA, we may use our notion of self-determined community and apply our algorithms and analysisto obtain communities in the original network.From our analysis for affinity systems, we immediately obtain that there is a polynomial number of suchcommunities in a social network, and they can be enumerated in polynomial time.

We note that while the input social network is potentially very sparse, appropriate lifting procedures canproduce an affinity system better reflecting the true relationships between entities. Moreover, many canbe performed locally, allowing for our local algorithm to determine meaningful communities especiallyefficiently.

4

We also note that our study of multi-facet affinity systems allows us to model and analyze communities inmore complex social networks – such as such Google+ with circles which enable its users to share differentthings with different circles of people. This extension mayalso enable us to model interdisciplinary sub-fields according to scientific works or interactions.

Self-determined Communities and(α, β)-clusters In this paper, we also provide several new results forcommunities defined as(α, β)-clusters, a notion introduced by Mishra, Schreiber, Stanton, and Tarjan [16]for analyzing (unweighted) social networks. In their definition,S is an(α, β)-cluster (forα > β) if for everyi ∈ S, the number of neighbors coming fromS is at leastα|S| and for everyj 6∈ S, the number of neighborscoming fromS is at mostβ|S|. We prove that there exists a family of networks with superpolynomialnumber of(α, β)-clusters. For instance, ifα = 1 andα − β = 0.01, then inG(n, 1/2), the Erdos-Renyirandom graph with1/2 edge probability, the expected number of(α, β)-clusters isnΩ(logn). We also showthat under the assumption that the planted clique problem ishard, even finding asingle (α, β)-cluster iscomputational hard. Interestingly, our notion of communities in social networks obtained via direct liftingis quite similar to the notion of(α, β)-clusters, with the only difference that we bound the total amount ofvotes a member may cast.2 This twist seems to be essential to obtain only a polynomial number of suchcommunities and to be able to enumerate them in polynomial time.

Related Work Problems of clustering or grouping data (based on network orpairwise similarity informationor ranked data) have been extensively studied in many different fields. The classic goals have been to produceeither a partition or a hierarchal clustering of the data [10, 6, 4, 8, 5, 13]. With the rise of online socialnetworks, there has been significant recent interest in identifying overlapping clusters, or communities, innetworks ranging from professional contact networks to citation networks to product-purchasing networks,with many heuristics and optimization criteria being proposed [14, 15, 19, 16, 17, 12]. However, much of thiswork has disallowed natural communities such as those containing highly popular nodes [21, 22, 14, 15] ornot given general guarantees on the number or computation time needed to find all overlapping communitiesmeeting natural criteria [7, 16, 17, 12]. By contrast, our new formalization leads to natural communities andefficient algorithms for identifying all such communities.Additionally, our model allows us to deal withasymmetries in the input in a very natural way.

Independently, in recent work [3] consider several assumptions (that are between worst case and averagecase) concerning community structure and provide efficientalgorithms in these settings. Remarkably, whiletheir setting is somewhat different from ours some of their algorithms are similar in spirit.

2 Preliminaries and Notation

In our most basic model, we consider an affinity system withn membersV = 1, ..., n and assumethat each memberi ∈ V states a strict rankingπi of all members in the order of her preferences. LetΠ = π1, . . . , πn. For t > 0, S ⊆ V , i ∈ V we denote byvtS(i) the number of members inS that placeiamong the topmostt elements of their preference list. That isvtS(i) = |s ∈ S|i ∈ πs(1 : t)| . For θ > 0,

we letφθS(i) := v⌈θ|S|⌉S (i). We define a natural notion of self-determined community as follows:

2For example, for the direct lifting ofG, the notion of the community we obtain is as follows:S is a(θ, α, β)-self-determinedcommunity inG if every i ∈ S receives at leastα|S| collective vote and everyone not inS receives at mostβ|S| collective vote.If G is unweighted,(i, j) ∈ E, i ∈ S, anddi is the out-degree ofi, then one way to set up the affinity system is to leti contributemin(1, θ|S|/di) to the collective vote ofj.

5

Definition 1 Given three positive parametersθ, α, β, whereβ < α ≤ 1 and an affinity system(V,Π) wesay that a subsetS of V is an (θ, α, β) self-determinedcommunity with respect to(V,Π) if we have both

(1) For all i ∈ S, φθS(i) ≥ α |S|.

(2) For all j 6∈ S, φθS(j) ≤ β |S|.

Throughout the paper, we will denote byγ = α − β. Fixing θ, we say that “i votes forj with respect to asubsetS” if j ∈ πi(1 : ⌈θ |S|⌉). WhenS is clear from the context we say thati votes forj.

Note that communitiesmay overlap. As a simple example, assume we have two setsA1 andA2 of sizen/2with n/8 nodes in common (representing, say, researchers in Algorithms and researchers in Complexity).Assume each node inAi \ Aj ranks first the nodes inAi and then the nodes inAj and that each node inAi ∩Aj ranks the nodes inAi ∪Aj arbitrarily. Then eachAi is a(1, 3/4, 1/4) self-determined community.

We also consider (more general) weighted affinity systems, where the preferences of each memberi involvenumerical weightings (degrees of affinity) rather than justan ordinal ranking. A weighted affinity system isexpressed asA = V, a1, ..., an, whereai is an-dimensional vectorai = (ai,1, ..., ai,n) and0 ≤ ai,j ≤ 1specifies the degree of affinity thati has forj. For example,i may give her top-ranked node a weight of1,she might have a tie between its second and third-ranked nodes giving both a weight of0.7, and so on. Ifmemberi chooses not to vote for a given node, this can be modeled by giving that node a weight of0.

We can naturally extend our notion of(θ, α, β)-self-determined communities to weighted affinity systems.In the definition of voting, the only requirement is that members are only allowed to cast votes up to a totalweight ofθt when voting for a community of sizet. For example, to evaluate whether a subsetS is a goodcommunity, each members ∈ S casts a weighted vote as follows:s determines a prefix of the weights(sorted from highest to lowest) of total valueθ|S| and zeros out the rest. If there are ties at the boundary, anatural conversion is to scale down the weights of those nodes just at the boundary to make the sum exactlyequal toθ|S|. In general, we denote the resulting vector (after capping the amount of vote a member casts

when voting for a community of sizet) asaθ|S|s . The amount of the weight that memberi ∈ V receives fromS is aθS(i) =

∑

s∈S aθ|S|s,i . Given these, we can define an (θ, α, β) weighted self-determinedcommunity as

follows:

Definition 2 Givenθ, α, β ≥ 0, β < α ≤ 1 and an weighted affinity system(V,A) we say that a subsetSof V is an (θ, α, β) weighted self-determinedcommunity with respect to(V,A) if we have both

(1) For all i ∈ S, aθS(i) ≥ α |S|.

(2) For all j 6∈ S, aθS(j) ≤ β |S|.

We note that given an (weighted) affinity system and a setS we cantest in time polynomial inn whethera proposed setS is a (θ, α, β)-self-determined community or not. Also, fixing a(θ, α, β)-self-determinedcommunityS, one can easily show that there exists a multisetU of sizek(γ) = 2 log (4n)/γ2 such that theset of elementsi voted by at least a(α − γ/2) fraction ofU (or in the weighted case, the set of elementsireceiving(α−γ/2)|U | total vote fromU ) is identical toS. This then implies a very simple quasi-polynomialprocedure for finding all self-determined communities, as well as annO(logn/γ2) upper bound on the numberof (θ, α, β)-self-determined communities. (See Appendix A.1 for details).

6

In this paper we present amulti-stageapproach for finding an unknown community in an affinity system thatprovides much better guarantees for interesting settings of the parameters. At a generic level, this algorithmtakes as input informationI about an unknown communityS and outputs a listL of subsets ofV s.t. ifinformation I is correct with respect toS, then with high probabilityL containsS. This algorithm hastwo main steps: it first generates a listL1 of setsS1 s.t. at least one of the elements inL1 is a roughapproximation toS in the sense thatS1 nearly containsS and it is not much larger thanS. In the secondstep, it runs a purification procedure to generate a listL that containsS. (See Algorithm 1.) Both stepshave to be done with care by exploiting properties of self-determined communities and we will describein detail in the following sections ways to implement both steps of this generic scheme. We also discusshow to adapt this scheme for outputting a self-determined community in a local manner, for enumerating allself-determined communities, as well as extensions to multi-facet affinity systems and applications of ouranalysis to social networks.

Algorithm 1 A generic algorithm for identifying an unknown communitySInput: Preference system(V,Π), informationI about an unknown communityS.

(1) Using informationI to generate a listL1 of setsS1 s.t. at least one of the elements inL1 is a roughapproximation toS.

(2) Run a purification procedure to generate a listL s.t. at least one of the elements inL is identical toS.

(3) Remove from the listL all the sets that are not self-determined communities.

Output: List of self-determined communitiesL.

3 Finding Self-determined Communities

In this section we show how to instantiate the generic Algorithm 1 if the information we are given about theunknown communityS is its size and the parametersθ, α, andβ. We show that this leads to a polynomialtime algorithm in the case whereθ, α, andβ are constant. We start with a structural result showing thatforany self-determined communityS there exist a small number of community members s.t. the union of theirvotes contains almost allS.

Lemma 1 Let S be a(θ, α, β)-self-determined community. Letγ = α − β, M = log (16/γ)/α. Thereexists a setU , |U | ≤M s.t. the setS1 = i ∈ V |∃s ∈ U, i ∈ πs(1 : θ|S|) satisfies|S \ S1| ≤ (γ/16)|S|.

Proof: Note that any subsetS of S receives a total of at leastα|S||S| votes from elements ofS, whichimplies that for any suchS there existsiS ∈ S that votes for at leastα|S| members ofS. Given this, we findthe desired elementsi1, . . . , iM ∈ S greedily one by one. Formally, letS1 = S. Let i1 ∈ S be an elementthat votes for at least aα|S1| elements inS1. Let S2 be the setS minus the set of elements voted byi1. Ingeneral, at stepl ≥ 2, there existsil ∈ S that votes by at least aα fraction ofSl. Let Sl+1 be the setSlminus the set of elements voted byil. We clearly have|Si+1| ≤ (1 − α)i|S1|, so|SM+1| ≤ (γ/16)|S1| forM = log (16/γ)/α. By construction the setU = i1, . . . , iM ∈ S satisfies the desired condition.

Given Lemma 1, we can use the following procedure for generating a list that contains a rough approxima-tion toS which covers at least a1− γ/16 fraction ofS and whose size is at mostlog (16/γ)|S|.

7

Algorithm 2 Generate rough approximationsInput: Preference system(V,Π), informationI (parametersθ, α, β, sizet).

• SetL = ∅, γ = α− β, k1(θ, α, γ) = log (16/γ)/α.

• Exhaustively search over all subsetsU of V of sizek1(θ, α, γ); for each setU add to the listL the setS1 ⊆ V of points voted by at least an element inU (i.e.,S1 = i ∈ V |∃s ∈ U, i ∈ πs(1 : θt)).

Output: List of setsL.

We now describe a lemma that will be useful for analyzing the purification step, suggesting how we converta rough approximation toS into a list of candidate much-closer approximations toS.

Lemma 2 Fix a (θ, α, β)-self-determined communityS. Letγ = α− β, t = |S|, andS1 ⊆ V , |S1| =Mθts.t. |S \ S1| ≤ γt/16. Let U be a set ofk points drawn uniformly at random fromS = S ∩ S1. LetS2 be the subset of points inS1 that are voted by at least anα − γ/2 fraction of nodes inU , i.e.,S2 =i ∈ S1|vθtU (i) ≥ (α − γ/2)|U |. If k = 8 log(32θM/δγ)/γ2, then with probability≥ 1 − δ, we have thesymmetric difference|∆(S2, S)| ≤ γt/8.

Proof: We start by showing that the points inS are voted by at least aγ/2 larger fraction ofS than thepoints inS1 \ S. Let i ∈ S. SinceS is (θ, α, β)-self-determined, at leastαt points inS vote fori and since|S \ S| ≤ γt/16 we get that at least(α− γ/16)t points inS vote fori. Since|S| ≤ t, we obtain that at leastaα− γ/16 fraction of points inS vote fori. Let j be a point inS1 \S. We know that at mostβt points inSvote forj and since|S| ≥ (1− γ/16)t, we have that at most aα− 3γ/4 fraction of points inS vote forj.

Fix i ∈ S1. By Hoeffding’s inequality, sinceU is a set of8 log(32θM/δγ)/γ2 points drawn uniformly atrandom fromS we have that with probability at least1 − γδ/(16θM) the fraction of points inS that votefor i is within γ/4 of the fraction of points inU that vote fori. These together with the above observationsimply that the expected size of|∆(S2, S)| is (γδ/(16θM))θMt = γδt/16. By Markov’s inequality weobtain that there is at most aδ chance that|∆(S2, S)| ≥ γt/16. Using the fact|S \ S| ≤ γt/16 we finallyget that with probability1− δ we have|∆(S2, S)| ≤ γt/8.

Algorithm 3 Purification procedureInput: Preference system(V,Π), informationI (parametersθ, α, β, γ, k2(θ, α, γ), N2(θ, α, γ), sizet), listof rough approximationsL1.

• For each elementS1 ∈ L1, repeatN2(θ, α, γ) times

• Sample a setU2 of k2(θ, α, γ) points at random fromS1. LetS2 = i ∈ S1|vθtU2(i) ≥ (α−γ/2)|U2|.

• Let S3 = i ∈ V |vθtS2(i) ≥ (α− γ/2)|S2|. AddS3 to the listL.


We now show how Lemmas 1 and 2 can be used to identify and enumerate communities.

Theorem 1 Fix a (θ, α, β)-self-determined communityS. Let γ = α − β, k1(θ, α, γ) = log (16/γ)/α,

k2(θ, α, γ) =8γ2

log(

32θk1γδ

)

, N2(θ, α, γ) = O((θk1)k2 log (1/δ)). Using Algorithm 2 together with Algo-

8

rithm 3 for steps (1) and (2) of Algorithm 1, we have that with probability ≥ 1− δ one of the elements in thelist L we output isidenticalto S.

Proof: Since when running Algorithm 2 we search over all subsets ofU of V of size k1(θ, α, γ), byLemma 1 in one of the rounds we find a setU s.t. the set of pointsS1 that are voted by at least an elementin U cover a1− γ/16 fraction ofS. So,L1 contains a rough approximation toS.

Since|S| = t,U2 is a set ofk2 elements drawn at random fromS = S∩S1 with probability≥ (t/(2tθk1))k2 .

Therefore forN2 = O((2θk1)k2 log(1/δ)), with probability≥ 1 − δ/2 in one of the rounds the setU2 is a

set ofk2 elements drawn at random fromS. In such a round, by Lemma 2, with probability≥ 1 − δ/2 weget a setS2 such that|∆(S2, S)| ≤ γt/8. A simple calculation shows thatS3 = S.

Corollary 1 The number of(θ, α, β)-self-determined communities in an affinity system(V,Π) satisfies

B(n) = nO(log (1/γ)/α)(

θ log (1/γ)α

)O(

1γ2

log(

θ log (1/γ)αγ

))

and with probability≥ 1 − 1/n we can find all

of them in timeB(n)poly(n).

We note that Theorem 1 and Corollary 1 apply even if some nodesdo not list all members ofV in theirpreference lists, and then some nodes in a communityS have fewer thanθ|S| votes in total. Ifθ, α, andβare constant, then Corollary 1 shows that the number of communities isO

(

nlog (1/γ)/α)

which ispolynomialin n and they can be found in polynomial time. We can show that the dependence onn1/α is necessary:

Theorem 2 For any constantθ ≥ 1, for anyα ≥ 2√θ/n1/4, there exists an instance such that the number

of (θ, α, β)-self-determined communities withα− β = γ = α/2 is nΩ(1/α).

PROOF SKETCH: ConsiderL =√n blobsB1, ...,BL each of size

√n. Assume that each point ranks the

points inside its blob first (in an arbitrary order) and it then ranks the points outside its blob randomly. Onecan show that with non-zero probability forl ≤ n1/4/(2

√θ) any union ofl blobs satisfies the(θ, α, β)-self-

stability property with parametersα = 1/l andγ = α/2. Full details appear in Appendix A.1.

3.1 Self-determined Communities in Weighted Affinity Systems

We provide here a simple efficient reduction from the weighted case to the non-weighted case.

Theorem 3 Given a weighted affinity system(V,A), θ, α, β, ǫ < α, and a community sizet, there is anefficient procedure that constructs a non-weighted instance (V ′,Π) along with a mappingf from V ′ to V ,s.t. for any(θ, α, β) communityS in V there exists a(θ, α− ǫ, β) communityS′ in (V ′,Π) with f(S′) = S.

Proof: Given the original weighted instance(V,A), we construct a non-weighted instance(V ′,Π) as fol-lows. For eachs ∈ V , we create a blobBs of k nodes inV ′. For anys, s ∈ V , if p is the weightaθts,s withwhich s votes fors, we connectBs toBs with Gk,k,⌊pk⌋, whereGk,k,⌊pk⌋ is a bipartite graph withk nodeson the left andk nodes on the right such that each edge on the left has out-degree⌊pk⌋ and each node on theright has in-degree⌊pk⌋. Clearly all nodes inV ′ rank at mostk|S|θ other nodes (and do not have an opinionabout the rest). Letk = 1/ǫ. Consider a communityS in (V,A). For anys ∈ S and for each node ini ∈ Bsthe total vote from nodes inBs for s ∈ S (when evaluating whether∪s∈SBs is a good community or not)

9

is at leastα|S|k − |S| ≥ k|S|(α − ǫ). Moreover, fors /∈ S and for each node inBs we have the total votefrom the nodes inBs for s ∈ S is at mostβ|S|k. Therefore∪s∈SBs is a legal(θ, α− ǫ, β)-self-determinedcommunity of sizekt in the non-weighted instance(V ′,A).

Using this reduction we immediately get the following result:

Theorem 4 For anyθ, α, β, γ = α − β, the number of weighted(θ, α, β)-self-determined communities is

B(n) = (n/γ)O(log (1/γ)/α)(

2θ log (1/γ)α

)O(

1γ2

log(

θ log (1/γ)αγ

))

and we can find them in timeB(n)poly(n).

Proof: We perform the reduction in Theorem 3 withǫ = γ/2 and use the algorithm in Theorem 1 and thebound in Corollary 1. The proof follows from the fact that thenumber of vertices in the new instance hasincreased by only a factor of2/γ. We also note that each set output on the reduced instance canthen beexamined on the original weighted affinity system, and kept iff it satisfies the community definition withoriginal parameters.

3.2 Self-determined Communities in Multi-faceted Affinity Systems

A multi-faceted affinity system is a system where each node may have more than one rankings of othernodes. Suppose that each elementi is allowed to have at mostf different rankingsπ1i , . . . , π

fi . We say

that the pair(S,ψ) is a multi-faceted community whereψ : S → 1, . . . , f, if S is a community whereψ(i) specifies which ranking facet should be used by elementi. In other words, as before, letφθS,ψ(i) :=

|s ∈ S|i ∈ πψ(s)s (1 : ⌈θ|S|⌉)|. Then (S,ψ) is a (α, β, θ)-multifaceted community if for alli ∈ S,

φθS,ψ(i) ≥ α|S|, and for allj /∈ S, φθS,ψ(j) < β|S|.We show that for a boundedf , even though there may be exponentially many functionsψ, it is not harderto find multifaceted communities than to find regular communities. Note that all our sampling algorithmscan be adapted as follows. Once a representative samplei1, . . . , ik of the communityS is obtained, wecan guess the facetsψ(i1), . . . , ψ(ik) while adding a multiplicativefk factor to the running time. We canthus get the setS2 approximatingS in the same way as it is found in Algorithms 2 and 3 while addinga multiplicative factor offk1+k2 to the running time. We thus obtain a listL that for each multi-facetedcommunity(S,ψ) contains a setS2 such that∆(S2, S) < γt/8. GivenS2 we can outputS with probability> f−8 logn/γ2/2: guess a setU2 of m = 8 log n/γ2 points inS2; guess a functionψ2 onU2; outputS = theset of points that receive at least(α− γ/2)t votes according to(U2, ψ2). Moreover, a facet structureψ′ canbe recovered onS so that(S,ψ′) is an(α− γ/4, β + γ/4, θ)-multifaceted community using a combinationof linear programming and sampling. Details appear in Appendix A.3.

Theorem 5 LetS be anf -faceted(α, β, θ)-community. Then there is an algorithm that runs inO(n2) timeand outputsS, as well as a facet structureψ′ onS such that(S,ψ′) is an(α−γ/4, β+γ/4, θ)-multifaceted

community with probability≥ (f · n)−O(log (1/γ)/α)(

f ·θ log (1/γ)α

)−O(

1γ2

log(

θ log (1/γ)αγ

))

f−O(logn/γ2).

4 A Local Algorithm for Finding Self-determined Communitie s

In this section we describe a local algorithm for finding a community. Given a single elementv and thetarget community sizet, the goal of the algorithm is to output a communityS of sizet containingv. Let us

10

fix a target communityS that we are trying to uncover this way.

We note that we needα > 1/2 for a local algorithm that uses only one seed to succeed. Ifα ≤ 1/2 then onemay have a valid(θ, α, β)-community that is comprised of two disjoint cliques of vertices. In this case, nolocal algorithm that starts with just one vertex as a seed mayuncoverbothcliques, however we can extendthe construction below if we start withO(1/α) seeds. Below, we focus on providing a local algorithm forα > 1/2. Our local algorithm will follow the structure of the generic Algorithm 1. The main technicalchallenge is to provide a local procedure for producing rough approximations. In general, it is not possibleto do so starting fromany seed vertexv ∈ S. For example, ifv is a super-popular vertex that is votedfirst by everyonein V , thenv will belong to all communities includingS, butv would contain no “specialinformation” that would allow one to identifyS. However, we will show thata constant fractionof thenodes inS are sufficiently “representative” ofS to enable one to recoverS.

Let us fix t andθ. For an elementv, we letR(v) be a uniformly random element which receivesv’s votewith these parameters. In other words,R(v) := uniform element ofπv(1 : θ · t). We start with the maintechnical claim that enables a local procedure for producing rough approximations.

Lemma 3 LetS be any(θ, α, β)-community of sizet. Letη := 2α − 1 > 0. Then there is a subsetT ⊆ S

such that|T | ≥ ηt and for each pairv ∈ T andu ∈ S, we havePr[R(R(v)) = u] ≥ (α−1/2)/θ2

t .

Proof: For each elementv ∈ S denote byOS(v) := πv(1 : θ · t) ∩ S – the elements ofS thatv votes for,and byIS(v) := u ∈ S : v ∈ πu(1 : θ · t) – the elements ofS that vote forv. By the community propertywe know that|IS(v)| ≥ αt for all v ∈ S. Observe that

∑

v∈S|OS(v)| =

∑

v∈S|IS(v)| ≥ αt2.

Hence at least anη-fraction of v’s in S must satisfy|OS(v)| ≥ t/2, whereη = 2α − 1. Let T :=v : |OS(v)| ≥ t/2 ⊆ S. For anyv ∈ T and anyu ∈ S, we have

|OS(v) ∩ IS(u)| ≥ |OS(v)|+ |IS(u)| − t ≥ (α− 1/2) · t.

To finish the proof note that

Pr[R(R(v)) = u] ≥ Pr[R(v) ∈ OS(v) ∩ IS(u)] ·1

θ · t ≥(α− 1/2) · t

θ · t · 1

θ · t =(α− 1/2)/θ2

t.

We call any vertexv in the setT in Lemma 3 a “good seed vertex” forS. Lemma 3 suggests a naturalprocedure (Algorithm 4) for generating a rough approximation in a local way given a good seed vertex.

Algorithm 4 Generate rough approximationsInput: Preference system(V,Π), informationI (parametersθ, α, β, γ, vertexv, sizet).

• SetS1 =

u : Pr[u = R(R(v))] ≥ (α−1/2)/θ2

t

.

Output: List of setsL = S1.

11

Theorem 6 Assumeα > 1/2. Letk2(θ, α, γ) = O(

log(θ/δγ(α−1/2))γ2

)

,N2(θ, α, γ) =(

θ2

α−1/2

)k2(θ,α,γ)log(1/δ).

Assumingv is a good seed element for a communityS, then by using Algorithm 4 together with Algorithm 3for steps (1) and (2) of Algorithm 1, we have that with probability ≥ 1− δ we will outputS.

Proof: It is enough to show that each iteration of the purification algorithm (Algorithm 3) has a probability

≥(

α−1/2θ2

)k2to outputS. Sincev is a good seed element ofS, the setS1 produced by Algorithm 4 must

containS. It is easy to see that|S1| ≤ tθ2/(α − 1/2). Thus, applying Lemma 2 withM = θ/(α − 1/2)we see that if the points ofU2 are drawn uniformly fromS, then with high probabilityS2 is γ/8-close toS,andS3 = S. Since conditioned onU2 ⊆ S, U2 is uniform inS, our probability of success is given by the

probability thatU2 ⊆ S, which is equal to(

|S||S1|

)k2 ≥(

α−1/2θ2

)k2, which completes the proof.

Note that whenα > 1/2, β, andθ are constants, the purification procedure will run in a constant number ofiterations. Our main result of this section is the following:

Theorem 7 Supposeα > 1/2. Assumeα, β, θ, andδ are constants. Ifv is chosen uniformly at randomfromS, then with probability at least(2α − 1)(1 − δ) we can findS in timeO(t log t).

Proof: First, by Lemma 3, with probability at least2α − 1, elementv is such that for allu ∈ S, we

havePr[R(R(v)) = u] ≥ α−1/2θ2t

. We now implement Algorithm 4 by performing(

8θ2tα−1/2

)

log(2t/δ)

random draws fromR(R(v)) and lettingS1 be the set of pointsu hit at least4 log(2t/δ) times. By Chernoffbounds, for eachu ∈ S, we have includedu in S1 with probability at least1 − e−8 log(2t/δ)/8 = 1 −δ/(2t), so with probability at least1 − δ/2 we haveS1 ⊇ S. Furthermore, since we only include points

hit at least4 log(2t/δ) times, we have|S1| ≤(

2θ2tα−1/2

)

. Thus, the analysis in Theorem 3 implies that

the purification step (Algorithm 3) will succeed with probability at least 1 − δ/2 for a choice ofN2 =(

2θ2

α−1/2

)k2(θ,α,γ)log(2/δ). Putting these together yields the desired success probability. Furthermore, since

α, β, θ, δ are constants, the overall time isO(t log t).

It is not hard to see that the algorithm in Theorem 6 will work even if t is given to it only up to some smallmultiplicative error. As a corollary of Theorem 6, we see that the number of communities is actually linearand we can find all of them in quasilinear time.

Theorem 8 Suppose thatα > 1/2. The total number of(θ, α, β)-self-determined communities is bounded

byO

(

n · 1min(γ,1/2−α) ·

(

θ2

α−1/2

)O(

log(θ/δγ(α−1/2))

γ2

))

, which isO(n) if α, β, andθ are constants.

Proof: It is easy to see that executing the Algorithm in Theorem 7 where we only do one iteration of thepurification step (i.e., of Algorithm 3) with inputst′ ∈ ((1 − ε)t, (1 + ε)t), α′ = α − 4ε, β′ = β + 4ε,θ′ = θ(1 + ε), and an appropriate seed vertexv ∈ S will lead to a discovery of an(θ, α, β)-community

of size |S| = t with probability ≥ p :=(

α−1/2θ2

)k2(θ,α,γ), as long asε is sufficiently small. Here it is

enough to takeε = min(γ, α − 1/2)/100. Thus a pair(v, t′), wherev is a vertex andt′ is the target sizecorresponds to at most1/p distinct communities. Moreover, each communityS of size t corresponds tomore thant(2α − 1)/2 such pairs. Sincet′ needs only to be within a multiplicative(1 + ε) from t, we canalways selectt′ from the set of values(1+ε)i : i = 0, 1, . . . , ⌈log1+ε n⌉. For each valuet′, the number of

12

communities of size betweent′ andt′(1 + ε) is thus bounded by the number of possible pairs(t′, v) (= n),times1/p and divided byt′(2α − 1)/2:

#communities of size betweent′ andt′(1 + ε) ≤ n

t′· 1/p

(2α− 1)/2.

Summing over the possible values oft′ we obtain the upper bound:

n · 2

ε(2α − 1)·(

θ2

α− 1/2

)k2(θ,α,γ)

,

which leads to the bound in the statement of the theorem.

Note: We can extend our local approach to weighted and multi-faceted affinity systems. See Appendix A.4.

4.1 An Alternative Non-local Algorithm

The analysis in this section suggests an alternative way forgenerating rough approximations in the non-localmodel which leads to an algorithm that provides asymptotically better bounds than Theorem 1 in interestingcases, in particular whenθ, α, andγ are constants and there is a large gap betweenα andγ. This leads toan improved polynomial bound ofnO(log(1/α)/α) on the number of(θ, α, β)-self-determined communitieswhenθ, α, andγ are constants using Algorithm 5:

Algorithm 5 Generate rough approximationsInput: Preference system(V,Π), informationI (parametersθ, α, β, sizet).

• SetL = ∅; γ = β − α.

• Exhaustively search over all subsetsU0 of V of size⌈(log 1/α)/α⌉ + 1; for eachU0 to theL the set

S1 :=

x :∑

y∈U0Pr[x = R(R(y))] ≥ α

2θ2t

.


Theorem 9 Fix a (θ, α, β)-self-determined communityS. Letγ = α − β, k1(θ, α, γ) = O (log (1/α)/α),

k2(θ, α, γ) = O(

1γ2

log(

θk1γδ

))

, N2(θ, α, γ) = O((θ2/α3)k2 log (1/δ)). Using Algorithm 5 together with

Algorithm 3 for steps (1) and (2) of Algorithm 1, then with probability ≥ 1 − δ one of the elements in thelist L we output isidenticalto S.

PROOF SKETCH: By using a reasoning similar to the one in Lemma 2 we can show that there exist a setU0 of ⌈(log 1/α)/α⌉ + 1 points such that the subsetU1 of points voted by at least a member inU0 contains≥ 1 − α/2 fraction ofS. We show in the following that the corresponding setS1 indeed coversS. Fix avertexx ∈ S. We need to show that

∑

y∈U0

Pr[x = R(R(y))] ≥ α

2θ2t.

LetQ ⊆ S be the set of elements that vote forx. We know that|Q| ≥ αt, sincex ∈ S. Thus

|U1 ∩Q| ≥ |U1|+ |Q| − |S| > αt/2.

13

Eachz ∈ U1 ∩Q contributes at least1/θ2t2 to the sum∑

y∈U0Pr[x = R(R(y))]. Thus this sum is at least

(αt/2) · (1/θ2t2) = α/(2θ2t). Hencex ∈ S1, as required. Moreover, by observing that

∑

x

∑

y∈U0

Pr[x = R(R(y))] =∑

y∈U0

∑

x

Pr[x = R(R(y))] < 1/α2,

we obtain|S1| < 2θ2tα3 .

Since when running Algorithm 5 we exhaustively search over all subsets ofU1 of V of sizek1(θ, α, γ), inone of the rounds we find a setU1 s.t. |S1| < 2θ2t

α3 , S ⊆ S1. So,L1 contains a rough approximation toS.Finally, using a reasoning similar to the one in Theorem 1 we get the desired conclusion.

Theorem 9 gives asymptotically better bounds than Theorem 1whenN1 = nk1(θ,α,γ) is the dominant termin the bound (e.g., whenθ, α, andγ are constants) and especially when there is a large gap betweenα andγ – sincek1 is reduced fromlog(16/γ)/α to ⌈log(1/α)/α⌉ + 1. On the other hand, Theorem 9 has worsedependence onθ andα in N2, so for certain parameter settings, Theorem 1 can be preferable especially ifone optimizes the constants in Lemmas 1 and 2 based on the given parameters.

5 Self-determined Communities in Social Networks

In this section we present a natural notion of self-determined communities in social networks and discusshow our analysis sheds light on the notion of(α, β)-clusters [16, 17, 12]. We assume that the input is adirected graphG = (V,E) and for a vertexi we denote bydi its out-degree. As discussed in Section 1,given a social network we can consider the affinity system induced by direct lifting and then consider self-determined communities in that affinity system. This leads to the following very natural notion:

Definition 3 LetG = (V,E) be a directed graph and letθ, α, β ≥ 0 with β < α ≤ 1. Consider the affinitysystem(V, a1, . . . , an) whereai,j = wi,j if (i, j) ∈ E andai,j = 0 otherwise. A subsetS ⊆ V is a (θ, α, β)self-determined community inG if it is a (θ, α, β) weighted self-determined community in(V, a1, . . . , an).

Note that when evaluating a community of sizet each nodei is allowed a total vote of at mostθt. Onenatural way to achieve this is to only fractionally count edges from high-degree nodesi, giving them weightmin(θt/di, 1) when evaluating a community of sizet in the induced weighted affinity system.

The community notion introduced in [16, 17] is as follows:

Definition 4 Letα, β with β < α ≤ 1 be two positive parameters. Given an undirected graph,G = (V,E),where every vertex has a self-loop, a subsetS ⊆ V is an(α, β)-cluster ifS is:

(1) Internally Dense:∀i ∈ S, |E(i, S)| ≥ α|S|.

(2) Externally Sparse:∀i /∈ S, |E(i, S)| ≤ β|S|.

The(α, β)-cluster notion resembles our community notion in Definition 3. In particular, in the case wherethe graph is undirected, Definition 3 is similar to Definition4, except that in the case of our Definition 3each nodei is allowed a total vote of at mostθt. As discussed above one way to achieve this is to onlyfractionally count edges from high-degree nodesi, giving them weightmin(θ|S|/di, 1). This distinction is

14

crucial for getting polynomial time algorithms. From our results in the previous sections we have that everygraph has only a polynomial number of communities satisfying Definition 3 and moreover, we can find allof them in polynomial time. In contrast, as we show, there exist graphs with a superpolynomial number of(α, β)-clusters.

Theorem 10 For any constantǫ,α = 1,α−β = 1/2−ǫ, there exist instances withnΩ(logn) (α, β)-clusters.

Proof: Consider the graphGn,p with p = 1/2l. Consider all(

nk

)

sets of sizek = 2 lognl (1− δ), whereδ is a

constant (determined later). For each such setS, the probability it is a clique is

p(k2) ≥ (1/2)ℓk

2/2 = (1/2)2 log2 n(1−δ)2/ℓ = n−k(1−δ).

We now want to show that conditioned onS being a clique, it is also an(α, β)-cluster with probability atleast1/2. This will imply that theexpectednumber of(α, β)-clusters is at least

0.5

(

n

k

)

n−k(1−δ) = nΩ(logn).

Fix such set of sizek = 2 lognl (1 − δ). The probability that a node outside is connected to more than a

(1/2 + ǫ)-fraction of the set is upper bounded by

2k(

1

2l

)k2(1+ǫ)

≤ n2l 2−

lk2(1+ǫ) = n

2l n−(1+ǫ)(1−δ).

By imposing2l − (1 + ǫ)(1− δ) < −1 + logn(2), we get that this probability is upper bounded by1/(2n).

So by union bound over all nodes we then get the desired result. We need to impose(1 + ǫ)(1 − δ) − 2l >

1 + logn(2). This is true forδ ≤ ǫ/4 andl > 12/ǫ andn large enough.

We note that for certain range of parameters our bounds in Theorem 13 this improves over the general upperbound given in [16, 17]. Moreover, we show that even in graphswith only one(α, β)-cluster, we show thatfinding this cluster is at least as hard solving theplanted clique problemfor planted cliques of sizeO(log n),which is believed to be hard (see, e.g., Hazan and Krauthgamer [11]).

The Hidden Clique Problem: In this problem, the input is a graph on n vertices drawn at random from thefollowing distributionGn,1/2,k pick a random graph fromGn,1/2 and plant in it a clique of sizek = k(n).The goal is to recover the planted clique (in polynomial time), with probability at least (say)1/2 over theinput distribution. The clique is hidden in the sense that its location is adversarial and not known to thealgorithm. The hidden clique problem becomes only easier ask gets larger, and the best polynomial-timealgorithm to date [1], solves the problem wheneverk = Ω(

√n). Finding a hidden clique fork = c log n for

anyc is believed to be hard. The decision version of this problem is also believed to be hard.

We begin with a simpler result that finding theapproximately-largest(α, β)-cluster is at least as hard as thehidden clique problem.

Theorem 11 Suppose that forα = 1 andβ − α = 1/4, there was an algorithm that for some constantc could find an(α, β)-cluster of size at leastMAX/c, whereMAX is size of the largest community withthose parameters. Then, that algorithm could be used to distinguish (1) a random graphGn,1/2 from (2) arandom graphGn,1/2 in which a clique of size2c log2(n) has been planted.

15

Proof: We can show that with probability at least1− 1/n the largest clique inGn,1/2 largest clique has size2 log(n), which implies the largest(α, β) cluster (withα = 1 andβ − α = 1/4) has size atmost2 log(n).On the other hand we can also show that with probability at least 1− 1/n, for c ≥ 8 ln 2 the planted cliqueof size2c log2(n) is a cluster with these parameters. Thus, under the assumptionthat distinguishing thesetwo cases is hard, the problem of finding the approximately-largest(α, β)-cluster is hard.

We now show that in fact, even finding a single(α, β)-cluster is as hard as the hidden clique problem. Here,instead ofGn,1/2 we will useGn,p for constantp > 1/2. Note that the hidden clique problem remains hardin this setting as well.3

Theorem 12 For sufficiently small (constant)γ andǫ, with probability at least1 − 3/n, we have that: (1)the graphGn,1−γ−ǫ has no(1, 1−γ) clusters; and (2) a hidden clique of size1ǫ2 log n is an(1, 1−γ) cluster.Therefore, finding even one such cluster is as hard as the hidden clique problem.

Proof: ConsiderGn,p for p = 1− γ − ǫ. We start by showing that with probability at least1− 1/n the sizeof the largest clique is at most−2 lnn

ln(1−γ−ǫ) . For anyk, the probability that there exists a clique of sizek is atmost

(

n

k

)

p(k2) ≤ nk

k!pk

2/2p−k/2.

Fork = −2 lnnln(1−γ−ǫ) = −2 logp n, this is

p−k/2

k!n−2 logp np2(logp n)

2=p−k/2

k!=n

k!= o(

1

n).

This immediately implies that with probability at least1−1/n,Gn,p does not contains any(1, 1−γ) clustersof size greater than−2 lnn

ln(1−γ−ǫ) .

We now show that with probability at least1 − 1/n, Gn,p does not contain any(1, 1 − γ) clusters of size≤ −2 lnn

ln(1−γ−ǫ) . For this, we will show that for any setS of size≤ −2 lnnln(1−γ−ǫ) and any nodev not in S, the

probability thatv connects to at least(1 − γ)|S| nodes insideS is at least1/√n. Because these events are

independent over the different nodesv, this implies that the probability that no nodev outsideS connects

to at least(1 − γ)|S| nodes insideS is at most(

1− 1√n

)n−k≤ e−

√n/2. By union bound over all setsS

of size at most −2 lnnln(1−γ−ǫ) , this will imply that the probability there exits a(1, 1 − γ) cluster of size at most

−2 lnnln(1−γ−ǫ) is at most1/n.

Consider a setS of sizek and a nodev outsideS. The probability thatv connects to more than(1 − γ)knodes insideS is at least

(

k

γk

)

(1− γ − ǫ)(1−γ)k(γ + ǫ)γk ≥ 1

k

(

(1− γ)ke

γk

)γk

(1− γ − ǫ)(1−γ)k(γ + ǫ)γk.

This follows from the fact that(

k

γk

)

=k(k − 1) . . . (k − γk + 1)

(γk)!≥ ((1 − γ)k)γk

k(γk/e)γk=

1

k

(

(1− γ)ke

γk

)γk

,

3In particular, if it were easy, then one could solve the decision version of the hidden clique problem forGn,1/2 by first addingadditional random edges and then solving the problem forGn,p. We assume here that the planted clique has size greater thanthelargest clique that would be found inG(n, p).

16

where we use the fact that(γk)! < 2√2πγk(γk/e)γk < k(γk/e)γk .

So, the probability thatv connects to more than(1− γ)k nodes insideS is at least

1

k(1− γ − ǫ)k

[

1− γ

γ· γ + ǫ

1− γ − ǫe

]γk

≥ [(1− γ − ǫ)eγ ]k1

k.

This is decreasing withk and thus it suffices to considerk = −2 lnnln(1−γ−ǫ) . For thisk, we get that the probability

thatv connects to more than(1− γ)k nodes insideS is at least

1

ke−2 lnne

− 2γ lnnln(1−γ−ǫ) =

1

kn−2− 2γ

ln(1−γ−ǫ) .

We want this to be greater than1/√n, and thus it suffices to have−2 − 2γ

ln(1−γ) > −0.4. This holds forγ = 0.1, ǫ = 0.01.

Finally, it is easy to show that with probability at least1 − 1/n, a hidden clique of sizek = 1ǫ2lnn

is a (1, 1 − γ) cluster. This follows by noticing that every vertex outsidethe clique has in expectationk(1 − γ − ǫ) connections insides the clique, so by Hoeffding bounds, theprobability it has more thank(1− γ− ǫ)+ ǫk = k(1− γ) neighbors inside the clique is at most1/n2. By union bound, we get that withprobability at least1− 1/n every vertex outside the clique has at mostk(1− γ) neighbors inside the cliqueso the planted clique is a community as desired.

References

[1] N. Alon, M. Krivelevich, and B. Sudakov. Finding a large hidden clique in a random graph.RandomStructures Algorithms, 1998.

[2] R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In47th AnnualSymposium on Foundations of Computer Science, 2006.

[3] A. Arora, T. Ge, S. Sachdeva, and G. Schoenebeck. Findingoverlapping communities in social net-works: Toward a rigorous approach. Manuscript, 2011.

[4] M.-F. Balcan, A. Blum, and S. Vempala. A discriminative framework for clustering via similarityfunctions. InProceedings of the 40th ACM Symposium on Theory of Computing, 2008.

[5] L. M. Busse, P. Orbanz, and J. M. Buhmann. Cluster analysis of heterogeneous rank data. InProceed-ings of the 24th Annual International Conference on MachineLearning, 2007.

[6] M. Charikar, S. Guha, E. Tardos, and D. B. Shmoys. A constant-factor approximation algorithmfor the k-median problem. InProceedings of the Thirty-First Annual ACM Symposium on Theory ofComputing, 1999.

[7] W. Chen, Z. Liu, X. Sun, and Y. Wang. A game-theoretic framework to identify overlapping com-munities in social networks.Data Mining and Knowledge Discovery Journal, special issueon ECMLPKDD, 2010.

[8] D. Cheng, R. Kannan, S. Vempala, and G. Wang. A divide-and-merge methodology for clustering. InProceedings of the ACM Symposium on Principles of Database Systems, 2005.

17

[9] P. Doyle and J. Snell.Random walks and electric networks. Mathematical Assoc. of America, 1984.

[10] R. Duda, P. E. Hart, and D. G. Stork.Pattern classification. Wiley, 2001.

[11] E. Hazan and R. Krauthgamer. How hard is it to approximate the best nash equilibrium.SIAM Journalon Computing, 2011.

[12] J. He, J. E. Hopcroft, H. Liang, S. Suwajanakorn, and L. Wang. Detecting the structure of socialnetworks using (α, β)-communities. In8th International Conference on Algorithms and Models forthe Web-graph, 2011.

[13] K. Jain, M. Mahdian, and A. Saberi. A new greedy approachfor facility location problems. InPro-ceedings of the 34th Annual ACM Symposium on Theory of Computing, 2002.

[14] J. Leskovec, K. Lang, A. Dasgupta, and M. Mahoney. Community structure in large networks: Naturalcluster sizes and the absence of large well-defined clusters. Internet Mathematics, 2009.

[15] J. Leskovec, K. Lang, and M. Mahoney. Empirical comparison of algorithms for network communitydetection. InACM WWW International conference on World Wide Web, 2010.

[16] N. Mishra, R. Schreiber, I. Stanton, and R. Tarjan. Clustering social networks. In5th InternationalConference on Algorithms and Models for the Web-graph, 2007.

[17] N. Mishra, R. Schreiber, I. Stanton, and R. Tarjan. Finding strongly-knit clusters in social networks.Internet Mathematics, 2009.

[18] G. Narasimhan and M. Smid.Geometric Spanning Networks. Cambridge University Press, 2007.

[19] M. E. J. Newman. Modularity and community structure in networks. Proceedings of the NationalAcademy of Sciences, 2006.

[20] D. A. Spielman and S.-H. Teng. Nearly-linear time algorithms for graph partitioning, graph sparsifica-tion, and solving linear systems. InProceedings of the thirty-sixth annual ACM Symposium on Theoryof Computing, 2004.

[21] D. A. Spielman and S.-H. Teng. A local clustering algorithm for massive graphs and its ap-plication to nearly-linear time graph partitioning.CoRR, abs/0809.3232, 2008. Available athttp://arxiv.org/abs/0809.3232. Submitted to SICOMP.

[22] K. Voevodski, S. Teng, and Y. Xia. Finding local communities in protein networks.BMC Bioinfor-matics, 2009.

A Additional Proofs

A.1 Finding Self-determined Communities in Quasi-Polynomial Time

We present here a simple quasi-polynomial algorithm for enumerating all the self-determined communities.

18

Theorem 13 For any θ, α, β, γ = α − β, there arenO(logn/γ2) sets which are(θ, α, β) (weighted) self-determined communities. All such communities can be found by using Algorithm 6 with parametersθ, α, β,γ = α− β andk(γ) = 2 log (4n)/γ2.

Proof: Fix a (θ, α, β) (weighted) self-determined communityS. We show that there exists a multisetU ofsizek(γ) = 2 log (4n)/γ2 such that the setSU of points inV that receive at least(α− γ/2)|U | amount ofvote from points inU is identical toS. The proof follows simply by the probabilistic method. Let us fixa pointi ∈ V . By Hoeffding, if we draw a setU of 2 log (4n)/γ2 uniformly at random fromS, then withprobability 1 − 1/(2n), the average amount of vote thati receives from points inU is within γ/2 of theaverage amount of vote thati receives from points inS. By union bound, we get that with probability atleast1/2, for all points inV the average amount of vote that they receive from points inU is within γ/2 ofthe average amount of vote that they receive from points inS. Using this together with the definition of aself-determined community, we get that with probability1/2 we obtainSU = S for U of size2 log (4n)/γ2

drawn uniformly at random fromS. This then implies that there must exist a multisetU of sizek(γ) suchthatSU = S.

Since in Algorithm 6 we exhaustively search over all multisetsU (of point fromV ) of sizek(γ), we clearlyget the listLwe output contains all the(θ, α, β) (weighted) self-determined communities. Moreover, clearly,nO(logn/γ2) is an upper bound on the number of(θ, α, β) (weighted) self-determined communities.

Algorithm 6 Algorithm for enumerating self-determined communitiesInput: Affinity system(V,Π), parametersθ, α, β, γ; k(γ);

• SetL = ∅.

• Exhaustively search over all multisetsU with elements fromV of sizek(γ).

• For t = 1 to n (determining the meaning of “vote for”) do:

• Let SU be the subset of points inV that receive at least(α− γ/2)|U | amount of vote frompoints inU . AddSU to the listL.

• Remove from the listL all the sets that are not(θ, α, β) weighted self-determined communities.

Output: List of self-determined communitiesL.

A.2 Additional Proofs in Section 3

THEOREM 2 For any constantθ ≥ 1, for anyα ≥ 2√θ/n1/4, there exists an instance such that the number

of (θ, α, β)-self-determined communities withβ − α = γ = α/2 is nΩ(1/α).

Proof: ConsiderL =√n blobsB1, ...,BL each of size

√n. Assume that each point ranks the points inside

its blob first (in an arbitrary order) and it then ranks the points outside its blob randomly. We claim that withnon-zero probability forl ≤ n1/4/(2

√θ) any union ofl blobs satisfies the(θ, α, β)-self-stability property

with parametersα = 1/l andγ = α/2.

Let us fix a setS which is a union ofl blobs. Note that for each pointi in S, the expected number of points

19

in S voting for i is√n+ (l

√n−√

n)θl√n−√

n

n−√n

.

Also, for a pointj not inS the expected number of points inS voting for j is

l√nθl√n−√

n

n−√n

≤ l√nθl√n

n≤ √

n/4,

for l ≤ n1/4/(2√θ). By Chernoff, we have that the probability thatj is voted by more than

√n/2 is at most

e−√n/48.

By union bound, we get that the probability that there existsa setS which is a union ofl blobs that does notsatisfy the(θ, α, β)-self-stability property withα = 1/l, γ = θ/2 is at most

n · nl/2 · e−√n/48 < 1,

for l ≤ n1/4/(2√θ).

COROLLARY 1 The number of(θ, α, β)-self-determined communities in an affinity system(V,Π) satisfies

B(n) = nO(log (1/γ)/α)(

θ log (1/γ)α

)O(

1γ2

log(

θ log (1/γ)αγ

))

and with probability≥ 1 − 1/n we can find all of

them in timeB(n)poly(n).

Proof: Consider a community sizet. For any(θ, α, β)-self-determined communityS and letpS be theprobability thatS is in the list output by Algorithm in Theorem 1 with parameters θ, α, β, t. By Theorem 1we have thatpS ≥ 1 − δ. By linearly of expectation we have that

∑

S pS is the expected number of(θ, α, β)-self-determined communities in the list output by our algorithm. Combining these, we obtain that

B(n)(1 − δ) ≤ ∑S pS ≤ N1(δ)N2(δ) wherek1 = log (16/γ)/α, k2(δ) = 8γ2

log(

32θk1γδ

)

, N1(δ) = nk1

andN2(δ) = O((2θk1)k2(δ) log(1/δ)). By settingδ = 1/2, we get the desired bound,

B(n) = nO(log (1/γ)/α)

(

θ log (1/γ)

α

)O(

1γ2

log(

θ log (1/γ)αγ

))

.

LetN = N1(1/2)N2(1/2)n. By running the algorithm in Theorem 12 log[N ] times we have that for each(θ, α, β)-self-determined communityS, the probability thatS is not output in any of the runs is at most(1/2)2 log(N) ≤ 1/N2. By union bound, with probability at least1− 1/n, we output all of them.

A.3 Self-determined Communities in Multi-faceted Affinity Systems

Recall that a multi-faceted affinity system is a system whereeach node may have more than one rankingsof other nodes. This may reflect, for example, that a person may have two rankings of other people, onecorresponding to personal friends (in descending order of affinity), and one of co-workers. Suppose that eachelementi is allowed to have at mostf different rankingsπ1i , . . . , π

fi . We say that the pair(S,ψ) is a multi-

faceted community whereψ : S → 1, . . . , f, if S is a community whereψ(i) specifies which ranking

facet should be used by elementi. In other words, as before, letφθS,ψ(i) := |s ∈ S|i ∈ πψ(s)s (1 : ⌈θ|S|⌉)|.

Then(S,ψ) is an(α, β, θ)-multifaceted community if for alli ∈ S, φθS,ψ(i) ≥ α|S|, and for allj /∈ S,

φθS,ψ(j) < β|S|.

20

For a boundedf , it is not harder to find multifaceted communities than to findregular communities. Notethat in all our sampling algorithms can be adapted as follows. Once a representative samplei1, . . . , ikof the communityS is obtained, we can guess the facetsψ(i1), . . . , ψ(ik) while adding a multiplicativefk

factor to the running time. We can thus get the setS2 approximatingS in the same way as it is found inAlgorithms 2 and 3 while adding a multiplicative factor offk1+k2 to the running time. We thus obtain a listL that for each multi-faceted community(S,ψ) contains setS2 such that∆(S2, S) < γt/8:

Claim 1 We can output a listL of (f · n)O(log (1/γ)/α)(

f ·θ log (1/γ)α

)O(

1γ2

log(

θ log (1/γ)αγ

))

sets, such that for

each multi-faceted communityS there is anS2 ∈ L such that∆(S2, S) < γt/8.

It remains to show that:

Lemma 4 Suppose that(S,ψ) is a valid (α, β, θ)-multifaceted community of sizet. Givent and a setS2such that∆(S2, S) < γt/8, there is an algorithm that outputsS with probability> f−8 logn/γ2/2.

Moreover, a facet structureψ′ can be recovered onS so that(S,ψ′) is an(α−γ/4, β+γ/4, θ)-multifacetedcommunity.

Together with Claim 1, Lemma 4 shows that multifaceted communities can indeed be recovered in polyno-mial time.

THEOREM 5 LetS be anf -faceted(α, β, θ)-community. Then there is an algorithm that runs inO(n2) timeand outputsS, as well as a facet structureψ′ onS such that(S,ψ′) is an(α− γ/4, β+ γ/4, θ)-multifacetedcommunity with probability at least

(f · n)−O(log (1/γ)/α)

(

f · θ log (1/γ)α

)−O(

1γ2

log(

θ log (1/γ)αγ

))

f−O(logn/γ2).

Proof:(of Lemma 4). The algorithm is very simple. Guess a setU2 of m = 8 log n/γ2 points inS2; guess afunctionψ2 onU2; outputS = the set of points that receive at least(α− γ/2)t votes according to(U2, ψ2).

Note that in the non-faceted case, by Hoeffding’s inequality, with probability> 1/2 selecting a setU2 asabove and then selecting those points that receive at least(α − γ/2)t votes fromU2 would have yieldedS.This is because each element ofS receives at least(α−γ/8)t votes from elements ofS2, while each elementof the complementSc receives at most(β + γ/8)t votes from elements ofS2. This reasoning extends tothe multifaceted setting,provided, the functionψ2 coincides with the functionψ on the elements ofU2 ∩ S.This indeed happens with probability≥ f−|U2| = f−8 logn/γ2 , completing the proof of the first part of thelemma.

For the second part of the lemma we assume that the setS is known and we need to recover the facetsψ′

that makeS a community. Note that this step is necessary in order to verify thatS is indeed a multifacetedcommunity. There are two cases to consider.

Case 1: t ≤ 8 log n/γ2. In this case we can findψ by exhaustively checking all possibilities in timeO(q8 logn/γ

2), which is the same as the probability of success of the first step.

Case 2: t > 8 log n/γ2. In this case we use linear programming to find a fractional version ψf of thefunctionψ first. In other words, we find a functionψf : S × 1, . . . , q → [0, 1] such that(S,ψf ) is a“community” on average:

21

1. for all s ∈ S,∑f

i=1 ψf (s, i) = 1;

2. for all x ∈ S,∑

s∈S∑f

i=1 ψf (s, i) · χx∈πis(1:θt)

≥ αt;

3. for all y /∈ S,∑

s∈S∑f

i=1 ψf (s, i) · χy∈πis(1:θt)

< βt;

This linear program is feasible, since the originalψ is an integral solution to it. As a result, we obtain afractional solutionψf satisfying the three conditions. To obtainψ′ we roundψf by sampling. In otherwords, we setψ′(s) = i with probability ψf (s, i). By Hoeffding’s inequality, sincet > 8 log n/γ2, thesampling will preserve conditions 2 and 3 that were imposed onψf up to an additive error ofγ/4. Thus, bydefinition,(S,ψ′) will be an(α− γ/4, β + γ/4, θ)-multifaceted community.

A.4 Extensions to weighted affinity systems and to the local model

We note that that Algorithm 4 can be combined with our reduction from weighted to unweighted communi-ties to obtain a local algorithm for finding communities in the weighted case.

Extending the local approach to the multi-faceted setting is more involved, since the definition ofR(v)would need to be adapted to this setting. Indeed, the multi-faceted versionRf (v) ofR(v) can be taken to bea random element voted by a random faceti of v. Then Algorithm 4 can be adapted by taking the threshold

to be (α−1/2)/(θ2f2)t , wheref is the number of facets. Note that while an approximation to any community

S can be found locally in near-linear time, finding the exact communityS as well as the facet structure onS as in Lemma 4 will still takefO(logn/γ2) time.

22

Finding Endogenously Formed Communitiesninamf/papers/communities.pdfarXiv:1201.4899v2 [cs.DS] 1 Mar...

Documents

Transcript of Finding Endogenously Formed Communitiesninamf/papers/communities.pdfarXiv:1201.4899v2 [cs.DS] 1 Mar...