Modeling and analysis of higher-order interactions in ... › CUNY_GC › media › Computer-Science...
Transcript of Modeling and analysis of higher-order interactions in ... › CUNY_GC › media › Computer-Science...
Modeling and analysis ofhigher-order interactions in
collaborative networks.
Alexey Nikolaev.
CUNY Graduate Center,Computer Science Program.
1
Goals
• How to model high-order interactions?
• Which systems possess them?
• How to deal with this higher-order data, and
what results can be obtained?
2
How to represent a (collaborative) networks?
• Trees
• Graphs
• Simplicial complexes
• Hypergraphs
• Bipartite graphs
3
Graphs. Scale-free networks. P(deg)∼ deg−γ
WWW network of websites connected by hyperlinks, Internet at therouter level, protein-protein interactions, and many others belong to the
class of scale-free networks.
4
5
Collaborative co-authorship networksIn the simplest case a co-authorship network can bemodeled as a graph, where allparticipating persons are nodeswith edges between each pairof authors who have written atleast one paper together.Then the famous Erdos num-ber denotes the distance in thisgraph from any scientist to PaulErdos.This view of the network is usedin the widely-cited article by[newman2001structure].
6
Collaborative co-authorship networks[newman2001structure] considered four scientific publicationdatabases (MEDLINE, Los Alamos e-Print Archive, SPIRES, andNCSTRL).
The degree distributions wereclose to a power law but not ex-actly, and were better fitted by apower law with exponential cut-off
P(d)∼ d−τe−d/de
7
Collaborative co-authorship networksThe average “degree of separation” was found to be close to 6, and thediameter of the networks was no more than about 20.
The author also explored the percolation transition and the formationof a giant component, as well as the clustering of the network.
Def. 1. The clustering coefficient C in the context of the co-authorshipgraph is the average (over all authors) fraction of pairs of a person’scollaborators who have also collaborated with one another.
8
The role of triangles in ArXiv co-authorship[lambiotte2005n] considered all ArXiv papers from the interval[1995, 2005] that contain the word “network” in their abstract andclassified as “cond-mat” paper. The reason for choosing this dataset wasthat it has few papers with very large number of authors, usually 1-4co-authors per paper.
For the dataset, among all the triangles in it, there were 5550 tianglesformed by the co-authors of the same paper, and only 30 formed byonly pairwise interactions of three authors. Thus in such realco-authorship networks high-order interaction of three authors is morelikely to occur than their pairwise analog. (Probably it’s even morelikely for the scientific communities with large number of co-authors.)
9
Higher-order collaborations in co-authorshipAlternatives to the graph model. Bipartite graph where the nodes aredivided into (Authors, Papers). This is the most complete and detailedview of the network, but at the same time it ma not reveal the reallyinteresting properties of the network, until they are aggregated to asimpler form.
Simplicial complex is another possibility where a face belongs to thecomplex if all its authors wrote a paper together.
The simplicial complex model had been proposed for describingcollaborating teams, specifically when multy-way relations between theparticipants must be captured. For exmaple, see:[ramanathan2011beyond], [johnson2009hypernetworks],[andjelkovic2015hierarchical], [maletic2014consensus].
10
Simplicial complex. DefinitionAn abstract simplicial complex is a mathematical object similar tohypergraph, satisfying an additional “hereditary” property: if e is ahyperedge, then all subsets of e are hyperedges too.
Def. 2. An abstract simplicial complex (V, S), or simplicial complex forshort, is a collection of nodes V together with a non-empty collection oftheir subsets S ⊆ 2V that is closed under the subset operation. In otherwords, if s ∈ S, then ∀t ⊆ s : t ∈ S. The set of nodes V is also called thegrounds set.
11
Simplicial complex. DefinitionDef. 3. Each s ∈ S is called a face (or a simplex) of the complex (V, S).Def. 4. The dimension of a face s is defined as dim(s) = |s ` 1, and thedimension of the entire complex dim(V, S) =max{dim(s) | s ∈ S}.Def. 5. The maximal faces of a complex is called facets. In other words, afacet is a face that is not a proper subset of any other face of the complex.Def. 6. A simplicial complex (V ′, S′) is called a subcomplex of (V, S) ifV ′ ⊆ V and S′ ⊆ S.Def. 7. A D-skeleton of a complex is a subcomplex containing all the facesof the original complex whose dimension is less or equal to D.
For example, the 1-skeleton of a complex is its underlying graph.
12
Topological “holes” as missed collaborations[ramanathan2011beyond] considered scientific co-authorship networkas a simplicial complex. They looked for possibly “missed”collaborations, the groups of scientists who were working in the samefield and have common co-authors but never wrote a paper together.
The Betti numbers of the networkare B0 = 1, B1 = 2, and B≥3 = 0.
One connected component and twoholes.
Many of such missed collaborations are topological “holes” in thesimplicial complex, which are exactly represented by the Betti numbersof the complex.
13
Homology of simplicial complexesWe assume an orientation for each of the simplices of the complex: Thesimplex [a, b, c] = [b, c, a] = [c, a, b] has the opposite orientation than[a, c, b]. So [a, b, c] = −[a, c, b], and [a, b] = −[b, a]. Generallyspeaking, swapping two vertices changes the orientation (the sign) ofthe simplex.
Let Ck be the vector space spanning all simplexes of the dimension k ofthe given complex X . Ck is an abelian group, and simply speaking, itcontains all possible “combinations” of the k-simplexes of thedimension k.
Cdim(X ) · · · C3 C2 C1 C0
14
Homology of simplicial complexesWe can define the boundary map ∂k : Ck→ Ck−1 that maps any such“combination” of simplices to their “boundary”. It is enough to definehow the map ooperates on the basis elements of Ck
∂k[v0, . . . vk] =k∑
i=0
(−1)i[v0, . . . vi−1, vi+1, . . . vk]
15
Homology of simplicial complexesThe number of topological “holes” of the dimension k, also called thek-th Betti number is given by the formula:
Bk = dim(ker(∂k))− dim(im(∂k+1)).
Kernel and image of a linear map f : A→ B defined asker( f ) = {x ∈ A | f (x) = 0} and im( f ) = {y ∈ B | f (x) = y, x ∈ A}, aresubspaces of A and B, respectively.
Informally, the k-th Betti number refers to:
• B0 is the number of connected components• B1 is the number of “circular” holes• B2 is the number of “voids” or “cavities”, etc.
The torus T2 (not to be confused with the solid torus) has B0 = 1,B1 = 2, and B2 = 1.
16
Homology of simplicial complexesThere is the rank-nullity theorem relating their dimensions:
dim(ker( f )) + dim(im( f )) = dim(A).
Also if the map f is represented by a matrix, then dim(im( f )) anddim(ker( f )) are the rank and the nullity if that marix repsectively.
An efficient persistence algorithm for computing simplicial complexhomology is proposed by [zomorodian2005computing].
17
Q-analysis (Atkin 1974)Def. 8 (q-nearness). Two simplexes σ1 and σ2 are q-near if theirintersection is a q-dimensional face.Def. 9 (q-connectivity). Two simplexes σ1 and σ2 are q-connected ifthere is a sequence of pairwise q-near simplexes between them.
18
Q-analysis (Atkin 1974)Division of a complex into its q-connected components:
19
Simplicial complexmodel for opinion formationEach agent has an opinionwhich is a set of judgments(and seen as a simplex of thesimplicial complex). The judg-ments are the vertices of thatcomplex.
All interacting agents are con-nected into a graph network.
There are two networks: a graph of agents, and a simplicial complex oftheir opinions ([maletic2014consensus]).
20
Simplicial complexmodel for opinion formationThe system evolves according these rules
• If two agents have the same opinion, everyone of their commonneighbors adopts the same opinion.
• If two agents have different opinions, either one of them replacesone of their original judgments with a new one from the otheragent, or they form one common union opinion (the choicebetween the two options is random, and the probability of formingthe union opinion is proportional to the “overlap” of theiropinions).
• A new judgment can be added to a simplex that has the maximum“significance”.
21
The number of opinions in the network for (left) modular scale-freenetwork, and (right) random graph network.
The number of judgments in the end vs in the beginning of simulation
22
Academic teams as evolving hypergraphs[taramasco2010academic] modeled scientific collaboration as anevolving hypergraph of people and concepts.
Several hypotheses on novelty introduction were tested. In particular, itwas found that novel concepts are introduced equally likely by new andrepeating collaboration. But it is more likely that they are introducedeither by teams of newcomers or by well-established teams of experts.
23
Broadcasting in multi-hopwireless networksAs it was shown by [ramanathan2011beyond], broadcasting in amilti-hop network can be modelled as a simplicial complex. It is anatural representation, because any set of recipients that receive thesame packet via one transmission defines a face of the simplicialcomplex. This type of simplicial complex is called a neighborhoodsimplicial complex.Def. 10. The neighborhood complex of a graph is the set of simplicessuch that all vertices in a given simplex have a common neighbor. (First,introduced by [lovasz1978kneser].)
Then a generalized cost-aware network-wide broadcast problem can beformulated as a variant of the minimum connected dominating setproblem for the neighborhood complex of the graph of the network.
24
Cech and Vietoris-Rips geometric complexesTwo widely studied geometric simplicial complexes are Cech andVietoris-Rips are defined as follows.Def. 11. Let X = {x1, x2, . . . xn} be a collection of points in Rd , and letr > 0. The Ceck complex consists of all simplices {xi1, . . . xik} whosepoints are completely within some d-dimensional closed ball of radius r.Def. 12. To the contrary, the Vietoris-Rips complex consists of allsimplices {xi1, . . . xik} such that all their points are at the distance nomore than r, that is, ||xi − x j|| ≤ r.
25
Cech and Vietoris-Rips geometric complexesIt is the case that if the original set of points was sampled uniformly atrandom, then its Vietoris-Rips complex is a clique complex (also calledflag complex) of a random geometric graph (RGG) of radius r.
Example usage in topological data analysis
26
Protein-protein interaction (PPI) network
[Huang, Hsuan-Ting, et al. "A network of epigenetic regulators guides developmental haematopoiesis in
vivo." Nature cell biology 15.12 (2013): 1516-1525.]
27
Proteins. Protein-complex purificationThe process of PCP reduc-ing the protein complexesto a graph as shown in[gagneur2004modular].(a) If two proteins are foundin a complex they form anedge,(b) No distinction is madebetween three pairs and atriplet (the graph model isused instead of a more accu-rate hypergraph or simplicialcomplex)
28
Proteins complexes. Modular decompositionAfter constructing the graph, the authors perform its modulardecomposition, which is a technique of transforming the original graphinto a tree of nested modules. It is performed by sequentially replacingmodules with a single node, while making a distinction between threetypes of node composition, so-called “series”, “parallel”, and “prime”(which is an umbrella term for anything that does not fit into the twoprevious types). A module is a “series” composition of nodes if theirinduced subgraph is a clique (all are neighbors), and it is a “parallel”composition if the induced subgraph is an empty graph (neither of themare neighbors).Def. 13. A module is a set of vertices of a graph, all of whom share thesame neighbourhood outside the set.
29
30
Protein complexes. Modular decompositionA group of proteins in “parallel” composition are often substitutes ofeach other in a bigger complex: they occur independently asmutually-exclusive alternatives. Proteins in “series” compostition occurtogether, and this is an indication that they can be a stable substructureof a protein-protein interaction network, a group of proteins acting as asingle unit. And the “prime” composition modules are usually complexstructures that can be seen as irreducible backbones on the network.
One over-simplification of this approach is that the protein interactionsare first reduced to a graph. There remains a room for consideringhigher-order interactions (such as a hypergraph or a simplicial complexmodel). Second over-simplification is that the stoichiometric propertiesof complex formation (the number of copies of the same protein in thecomplex) are also neglected.
31
Co-occurrence of biological species
32
Co-occurrence of biological speciesThe observable data is a hypergraph (or a bipartite graph), where thenodes are the species, and each hyperedge corresponds to the set ofspecies that live in the same site. Since the data is collected overmultiple sites, we know which combinations of the species arecompatible. If one wishes, they can also interprete this data as asimplicial complex data, however the hypergraph is probably a moreaccurate representation. Usuall, this data is represented in the form ofincidence matrix that is called the co-occurrence (presence-absence)matrix.
In addition to the co-occurrence data, there is also the abundancematrix that tells how many individual organisms of each species weredetected in each site.
33
Co-occurrence. Network inferenceThere are several possible relationships between species. Pairwiserelations can be symbolically representedy by a 3-by-3 matrix, reflectingthe effect of the relation on the pair of species.
++ +0 +−0+ 00 0−−+ −0 −−
Mutualism (++) is the positive effect of both. The negative effect onboth is called Competition (−−). The others are: Predation (+−),Commensalism (+0), Amenalism (−0), Neutral (00),
One can score the relation by quantifying the species similarity (thesimilarity of their rows in the co-occurrence matrix means that theylikely to have positive effect of each other (++ or +0).
34
Co-occurrence. Network inferenceAlternatively, one can do regression analysis for identifying morecomplex and not necessarily pairwise relations. Unfortunately,non-pairwise relationships are combinatorially hard to test, there aretoo many possibilities there.
35
Co-occurrence. Community assembly rulesThe most influential work in this field of Ecology and Biology is thework done by Jared Diamond (see [cody1975ecology]), where hestudied the distribution of bird species over tropical islands of BismarkArchipelago, and proposed the community assembly rules that drivethe species coexistence. Although controversial and causing muchdebate, they were very influential and remain so to this day.
One of the rules says that the species that are adapted to the same nichedo not coexist (they are forbidden pairs). This competitive exclusionrelation between the species can be deduced from the
“checkerboard”-like�
0 11 0
�
patterns in the incidence matrix. The
second rule is saying that co-occuring species specialize and don’toccupy exactly the same niche (and so, don’t compete).
36
Co-occurrence. Community assembly rulesAccording to Diamond, real communities should contain fewer speciescombinations and have more checkerboard pairs than randomlyassembled communities.
The properties such as the number of checkerboard pairs (or morecomplex co-occurrence indices, such as the Checkerboard score C) canbe tested statistically, showing that the real communities are unlikely tooccur by random chance. [gotelli2002species] confirmed basicprediction of Diamond’s model (probably not giving a definitive answeryet).
37
Joint information (correlation)Agent i can execute one of its actions ai,1, . . . ai,ni with probability pi, j.(In case the probabilities are unknown they can be estimatedexperimentally from observation.)
A measure of the agent behavior is its Shannon entropy
H(ai) = −ni∑
j=1
pi, j log pi, j.
The definition can be extended to multiple agents (accounting for allpossible combinations of chosen actions).
38
Joint information (correlation)Then, for example, the entropy for a two-agent system is
H(a1, a2) = −n1n2∑
j=1
p j log p j.
The difference
I(a1 : a2) = H(a1) +H(a2)−H(a1, a2)
is called correlation or joint information.
When the agents act statstically independently it is equal to zero: thereis no information is shared, and there is no correlation. When theiractions are synchronized, the correlation is at maximum.
39
Joint information inminority game
(Left) Inefficiency (inverse performance) vs. size of the strategy space,(Right) Joint information in the Minority game
[parunak2003preliminary].
40
Data structures for simplicial complexesWhen one has to write code with simplicial complex data, say incomputational toplogy, one of the questions is how to efficeintlyrepresent simplicial complexes in the memory, so there is the need togood data structures.
One possibility is to store only the facets of the simplicial complex,and this may save a lot of space, because we don’t have to enumerateall the faces, when there may be exponentially many of them withrespect to the number of vertices.
If one wants to have more control, they can store all the faces of thecomplex. One possibility then is to store the Hasse diagram of thecomplex. This approach is used by the libraries JPlex([sexton_jplex_2008]) and Dionysus, among others.
41
Simplex tree data structure[boissonnat2014simplex] propose to use a trie data structure to storeall faces, each face being represented by a sorted list of its vertices.Additionally, all nodes with the same label at the same depth of the trieare connected via a doubly-linked list.
42
Simplex tree data structureEach node of the trie can be implemented either as a hash table, or ared-black tree. Assuming the maximum degree of a vertex in the1-skeleton (graph) of the simplicial complex is deg(T ), then the timecomplexity of searching, adding, and removing a new branch to thenode is amortized O(1) in the former case, and O(log(deg(T ))) in thelater case. However, using the red-black tree allows traversing all node’sbranches in sorted order.
43
Operations for simplex tree• Insert a simplex. To add a simplex [l0, . . . l j], one has to add 2 j+1
nodes to the tree, which yields time complexity O(2 jDm).• Locate all cofaces of a simplex σ in the complex X . A coface ofσ is a face of the complex that contains σ. Let σ = [l0, . . . l j].Using the linked lists connecting all occurrences of the last label l jof the simplex, we look at all occurrences of this label at the depthof the tree i > j. Each such node potentially can represent a cofaceand we have to check each of them by traversing the tree up to theroot. (If it is indeed a coface then all its descendent nodes will becofaces as well.) If there are T> j
l jof such occurrences of l j and the
dimension of the complex is k (and so the depth of the tree is atmost k), then each occurrence of the label l j takes O(k) time to
check. Thus in total it is O(kT> jl j) time.
44
Operations for simplex tree• Remove a simplex. Based on the previous operation. Remove the
node representing the simplex and remove all its cofaces.• Locate maximal proper faces of the simplex σ. Forσ = [l0, . . . l j], each of its maximal proper faces would be equal toσ minus one of its nodes. So there are j + 1 such maximal properfaces.Starting at the node representing [l0, . . . l j], to locate each of[l0, . . . li−1, li+1, . . . l j], we traverse towards the root of the treeuntil we reach [l0, . . . li−1] in O( j) time, and then tranverse downthe tree through the labels li+1 . . . l j, which would take timeO( jDm). Since there are j + 1 such faces that must be located, thetotal time complexity is O( j2Dm).
45
Operations for simplex treeThe data structure allows several topology preserving operations thatcan be convenient for simplifying the compelex without altering itshomology properties. In particular, it provides the elementary collapseand edge contraction operations.
Experimental results. Statistics and timing for construction ofVietoris-Rips complexes on two different datasets
([boissonnat2014simplex]).46
Future directions• Measuring strength of metworks of teams modeled as a simplicial
complex.• Wikipedia talk pages and editors’ collaboration.• Simplicial complex of the source code call graph• Information propagation in games and teamwork. Wildcat wells
game. Finding where higher-order interaction matter the most.
47